[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15054#discussion_r79540609 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -444,7 +444,7 @@ class SessionCatalogSuite extends SparkFunSuite { assert(!catalog.tableExists(TableIdentifier("view1", Some("default" } - test("getTableMetadata on temporary views") { + test("getTableMetadata and getTempViewOrPermanentTableMetadata on temporary views") { --- End diff -- looks like it's unnecessary to test `getTableMetadata` on temporary views, let's just test `getTempViewOrPermanentTableMetadata` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15135 I understand the reasons why you want to add this -- but I feel this is too esoteric and if we add this one, there are also a lot of other cases that can be added and I don't know where we would stop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15054#discussion_r79540494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -282,6 +271,24 @@ class SessionCatalog( } /** + * Retrieve the metadata of an existing temporary view or permanent table/view. + * If the temporary view does not exist, tries to get the metadata an existing permanent + * table/view. If no database is specified, assume the table/view is in the current database. + * If the specified table/view is not found in the database then a [[NoSuchTableException]] is + * thrown. + */ + def getTempViewOrPermanentTableMetadata(name: TableIdentifier): CatalogTable = synchronized { --- End diff -- it should just take a string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14959: [SPARK-17387][PYSPARK] Creating SparkContext() from pyth...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/14959 ``` The internal SparkConf of the context will not be the same instance as conf. ``` This is the existing implementation that python is different from scala. But I think it is correct. I guess the reason why in scala the internal SparkConf of the SparkContext is not the same instance as conf is to make sure changing SparkConf take effect after SparkContext is created would not take effect. pyspark is the same in this perspective. Although in pyspark the internal SparkConf of SparkContext is the same instance as conf, changing conf after SparkContext is created would not take effect as it is guaranteed in scala side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65631/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65631/consoleFull)** for PR 13513 at commit [`84d3d27`](https://github.com/apache/spark/commit/84d3d27490556dc1de4e4bce3b6b19a75691f52e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15102 **[Test build #65636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65636/consoleFull)** for PR 15102 at commit [`f5c57f5`](https://github.com/apache/spark/commit/f5c57f51f675002298c833edb486451642735221). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Statements ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15054 **[Test build #65637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65637/consoleFull)** for PR 15054 at commit [`48ce44e`](https://github.com/apache/spark/commit/48ce44e20a3db290c4c563b4d45ec5bfb6a86195). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 @rxin @davies @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14808: [SPARK-17156][ML][EXAMPLE] Add multiclass logistic regre...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14808 https://github.com/apache/spark/pull/14834 is merged now. We did not implement a new API, but we can still update the logistic regression examples to show the new multiclass functionality. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15147 I mean the problem in the JIRA is not reproduced in the master branch and therefore I believe we need another JIRA to describe the support for other time formats as the same one as casting operation as you suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/15054#discussion_r79539702 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -357,6 +346,21 @@ class SessionCatalog( tempTables.remove(formatTableName(name)) } + /** + * Retrieve the metadata of an existing temporary view. + * If the temporary view does not exist, return None. + */ + def getTempViewMetadataOption(name: String): Option[CatalogTable] = synchronized { --- End diff -- Yeah, we can combine them. Let me do it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14959: [SPARK-17387][PYSPARK] Creating SparkContext() fr...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/14959#discussion_r79539542 --- Diff: python/pyspark/java_gateway.py --- @@ -50,13 +50,18 @@ def launch_gateway(): # proper classpath and settings from spark-env.sh on_windows = platform.system() == "Windows" script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit" +command = [os.path.join(SPARK_HOME, script)] +if conf and conf.getAll(): +conf_items = [['--conf', '%s=%s' % (k, v)] for k, v in conf.getAll()] --- End diff -- Correct, will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14452: [SPARK-16849][SQL][WIP] Improve subquery execution by de...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14452 @davies Thanks for comment. In our initial benchmark of the TPC-DS queries (totally 13) using CTE, this PR helps about half (6) of them, 5 queries are not affected, 2 queries are regressed. I might say it would help CTE queries at most cases according to the results. I agree that 500+ LOC changes looks a bit big for this improvement. I would like to refactor and tailor part of the changes and break it down to small pieces for you to review. I wish I can reduce the LOC changes at the end. I would like to run Q64 by disabling the push down, may I ask the purpose for it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14852: [WIP][SPARK-17138][ML][MLib] Add Python API for multinom...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/14852 Now that https://github.com/apache/spark/pull/14834 has been merged, we can make the updates to Python API. There is no new interface to implement, but it would be great if this PR could take care of updating the Python side to reflect that LOR supports multiclass now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 A few high-level comments/questions: * Should this go into the `feature` package as a feature estimator/transformer? That is where other dimensionality reduction techniques have gone and I'm not sure we should create a new package for this. * Could you please point me to a specific section of a specific paper that documents the approaches used here? AFAICT, this patch implements something different than most of the Approximate nearest neighbors via LSH algorithms found in papers. For instance, the method in section 2 [here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) as well as the method on Wikipedia [here](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search) are different than the implementation in this pr. Also, the spark package [`spark-neighbors`](https://github.com/sethah/spark-neighbors) employs those approaches. I'm not an expert in LSH so I was just hoping for some clarification. * The implementation of the `RandomProjections` class actually follows the implementation of the "2-stable" (or more generically, "p-stable") LSH algorithm, and not the "Random Projection" algorithm in the paper that is referenced. At the very least, we should clarify this. Potentially, we should think of a better name. @karlhigley Would you mind taking a look at the patch, or providing your input on the comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65627/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65627/consoleFull)** for PR 14803 at commit [`541dfdc`](https://github.com/apache/spark/commit/541dfdc637b5373c384249a601d0a3e8486adb07). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13705: [SPARK-15472][SQL] Add support for writing in `csv` form...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13705 **[Test build #65635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65635/consoleFull)** for PR 13705 at commit [`9869f98`](https://github.com/apache/spark/commit/9869f9885e4fdc7364cd46ab05b1f332921ff8d7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15134: [SPARK-17580][CORE]Add random UUID as app name wh...
Github user phalodi closed the pull request at: https://github.com/apache/spark/pull/15134 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15133: [SPARK-17578][Docs] Add spark.app.name default va...
Github user phalodi closed the pull request at: https://github.com/apache/spark/pull/15133 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15126: [SPARK-17513][SQL] Make StreamExecution garbage-c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15126 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15126: [SPARK-17513][SQL] Make StreamExecution garbage-collect ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15126 Merging in master/2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15067: [SPARK-17513] [STREAMING] [SQL] Make StreamExecution gar...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15067 @frreiss can you close this now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15126: [SPARK-17513][SQL] Make StreamExecution garbage-collect ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15126 Since @frreiss hasn't updated the pr yet, I'm going to merge this one and assign the jira ticket to Fred. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14634 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65629/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14634 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14634 **[Test build #65629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65629/consoleFull)** for PR 14634 at commit [`90fbe4e`](https://github.com/apache/spark/commit/90fbe4e7bc8e80d7601eb020d428055a1a44797a). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAndAfter ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15157 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65624/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15157 **[Test build #65624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65624/consoleFull)** for PR 15157 at commit [`5b73205`](https://github.com/apache/spark/commit/5b732058ac911b6cb52a8639281681c3ee9d9dae). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15158: [SPARK-17603] [SQL] Utilize Hive-generated Statistics Fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15158 **[Test build #65634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65634/consoleFull)** for PR 15158 at commit [`061e60b`](https://github.com/apache/spark/commit/061e60b3af819f235e531b1de24f136a431dc23c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15147 @nbeyer Thanks for your investigation. I think that sounds reasonable though I think it might be arguable because adding more cases virtually means more time and computation to parse/infer schema (specifically in case of `TimestampType` in CSV) and there is for an option to specify the dateformat; however, it'd make sense that users don't really want to specify any option when they want to read time data. I'd follow committer's lead. Anyway, this might be not related with this JIRA anymore. How about creating another one describing current state and suggestion maybe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65625/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65625/consoleFull)** for PR 14803 at commit [`5b101ab`](https://github.com/apache/spark/commit/5b101aba62efd34077495eb55159ec1b93d2c90e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15158: [SPARK-17603] [SQL] Utilize Hive-generated Statis...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/15158 [SPARK-17603] [SQL] Utilize Hive-generated Statistics For Partitioned Tables ### What changes were proposed in this pull request? For non-partitioned tables, Hive-generated statistics are stored in table properties. However, for partitioned tables, Hive-generated statistics are stored in partition properties. Thus, we are unable to utilize the Hive-generated statistics for partitioned tables. The statistics might not be gathered for all the partitions in Hive. For partial collection, we will not utilize the Hive-generated statistics. ### How was this patch tested? Added test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark partitionedTableStatistics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15158.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15158 commit 061e60b3af819f235e531b1de24f136a431dc23c Author: gatorsmileDate: 2016-09-20T04:49:51Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/15090 This pr has been updated based on all the above comments, changes are as follows: 1. Modify analyze syntax a little bit: `identifierSeq` is now non-optional, i.e. users must specify column names after `FOR COLUMNS`. 2. Check column correctness based on case sensitivity. 3. Deduplicate columns when checking correctness. 4. Support analyzing columns independently, i.e. when analyzing new columns, now we don’t remove stats of columns which are analyzed before. 5. Rename `BasicColStats` to `ColumnStats`. 6. Use 3*standard deviation to check ndv result in test cases. 7. Code refactoring based on comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15090 **[Test build #65633 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65633/consoleFull)** for PR 15090 at commit [`0f974c0`](https://github.com/apache/spark/commit/0f974c019401ac5cef1be1b1e69a523ee2287101). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14834 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/14834 Merged into master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15147: [SPARK-17545] [SQL] Handle additional time offset format...
Github user nbeyer commented on the issue: https://github.com/apache/spark/pull/15147 @HyukjinKwon Based on my further reading of the code, I'd like to suggest that add a deprecation to the stringToTime method and then update the stringToTimestamp method, specifically here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L356 to handle the "no colon" case. It is the stringToTimestamp method that is used by the 'cast'. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14803 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65623/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65623/consoleFull)** for PR 14803 at commit [`23ba9a2`](https://github.com/apache/spark/commit/23ba9a23ab835987ed326a9320cf8632a0783885). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15150 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15150 **[Test build #65632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65632/consoleFull)** for PR 15150 at commit [`f7311a2`](https://github.com/apache/spark/commit/f7311a22d78b1875446e86aa53ad9f15892df7e2). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15150 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65632/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15082: [SPARK-17528][SQL] MutableProjection should not cache co...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15082 I re-targeted it to 2.1 only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r79533403 --- Diff: mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.lsh + +import scala.util.Random + +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.sql._ +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for output dimension. + * + * @group param + */ + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension", +ParamValidators.gt(0)) + + /** @group getParam */ + final def getOutputDim: Int = $(outputDim) + + setDefault(outputDim -> 1) + + setDefault(outputCol -> "lsh_output") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without outputCol + * @return A derived schema with outputCol added + */ + final def transformLSHSchema(schema: StructType): StructType = { +val outputFields = schema.fields :+ + StructField($(outputCol), new VectorUDT, nullable = false) +StructType(outputFields) + } +} + +/** + * Model produced by [[LSH]]. + */ +abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml] + extends Model[T] with LSHParams { + override def copy(extra: ParamMap): T = defaultCopy(extra) + /** + * :: DeveloperApi :: + * + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + protected[this] val hashFunction: KeyType => Vector + + /** + * :: DeveloperApi :: + * + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One of the point in the metric space + * @param y Another the point in the metric space + * @return The distance between x and y in double + */ + protected[ml] def keyDistance(x: KeyType, y: KeyType): Double + + /** + * :: DeveloperApi :: + * + * Calculate the distance between two different hash Vectors. By default, the distance is the + * minimum distance of two hash values in any dimension. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y in double + */ + protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +(x.asBreeze - y.asBreeze).toArray.map(math.abs).min --- End diff -- For a pair of `DenseVector`, you can directly use its `values` member and do something like: x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min For a pair of `SparseVector`, you may not need to conver `(x.asBreeze - y.asBreeze)` back to `Array`, because the resulting array should be sparse too. We can directly map on the Breeze vector, i.e., `(x.asBreeze - y.Breeze).map(math.abs).min`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15150 **[Test build #65632 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65632/consoleFull)** for PR 15150 at commit [`f7311a2`](https://github.com/apache/spark/commit/f7311a22d78b1875446e86aa53ad9f15892df7e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15046#discussion_r79533281 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -293,6 +293,39 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be Option(dir).map(spark.read.format("org.apache.spark.sql.test").load) } + test("read a data source that does not extend SchemaRelationProvider") { +val dfReader = spark.read + .option("from", "1") + .option("TO", "10") + .format("org.apache.spark.sql.sources.SimpleScanSource") + +// when users do not specify the schema +checkAnswer(dfReader.load(), spark.range(1, 11).toDF()) + +// when users specify the schema +val inputSchema = new StructType().add("s", IntegerType, nullable = false) +val e = intercept[AnalysisException] { dfReader.schema(inputSchema).load() } --- End diff -- there is not test for this case before? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15046#discussion_r79533310 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -327,8 +327,13 @@ case class DataSource( dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) case (_: SchemaRelationProvider, None) => throw new AnalysisException(s"A schema needs to be specified when using $className.") - case (_: RelationProvider, Some(_)) => -throw new AnalysisException(s"$className does not allow user-specified schemas.") + case (dataSource: RelationProvider, Some(schema)) => +val baseRelation = + dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) +if (baseRelation.schema != schema) { --- End diff -- cc @yhuai @liancheng to confirm, is it safe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15046#discussion_r79533043 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala --- @@ -345,34 +345,72 @@ class TableScanSuite extends DataSourceTest with SharedSQLContext { (1 to 10).map(Row(_)).toSeq) } + test("create a temp table that does not have a path in the option") { +Seq("TEMPORARY VIEW", "TABLE").foreach { tableType => + val tableName = "relationProvierWithSchema" + withTable(tableName) { +sql( + s""" + |CREATE $tableType $tableName --- End diff -- what does this test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15046#discussion_r79532870 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/TableScanSuite.scala --- @@ -345,34 +345,72 @@ class TableScanSuite extends DataSourceTest with SharedSQLContext { (1 to 10).map(Row(_)).toSeq) } + test("create a temp table that does not have a path in the option") { --- End diff -- `temp view` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15046: [SPARK-17492] [SQL] Fix Reading Cataloged Data So...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15046#discussion_r79532807 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala --- @@ -65,6 +65,26 @@ class InsertSuite extends DataSourceTest with SharedSQLContext { ) } + test("insert into a temp view that does not point to an insertable data source") { +import testImplicits._ +withTempView("t1", "t2") { + sql( +""" + |CREATE TEMPORARY TABLE t1 --- End diff -- let's use CREATE TEMPORARY VIEW --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15148 @Yunni Thanks for working on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65631/consoleFull)** for PR 13513 at commit [`84d3d27`](https://github.com/apache/spark/commit/84d3d27490556dc1de4e4bce3b6b19a75691f52e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/15148#discussion_r79532298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/lsh/LSH.scala --- @@ -0,0 +1,270 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.lsh + +import scala.util.Random + +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.linalg.{Vector, VectorUDT} +import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.sql._ +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[LSH]]. + */ +private[ml] trait LSHParams extends HasInputCol with HasOutputCol { + /** + * Param for output dimension. + * + * @group param + */ + final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension", +ParamValidators.gt(0)) + + /** @group getParam */ + final def getOutputDim: Int = $(outputDim) + + setDefault(outputDim -> 1) + + setDefault(outputCol -> "lsh_output") + + /** + * Transform the Schema for LSH + * @param schema The schema of the input dataset without outputCol + * @return A derived schema with outputCol added + */ + final def transformLSHSchema(schema: StructType): StructType = { +val outputFields = schema.fields :+ + StructField($(outputCol), new VectorUDT, nullable = false) +StructType(outputFields) + } +} + +/** + * Model produced by [[LSH]]. + */ +abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml] + extends Model[T] with LSHParams { + override def copy(extra: ParamMap): T = defaultCopy(extra) + /** + * :: DeveloperApi :: + * + * The hash function of LSH, mapping a predefined KeyType to a Vector + * @return The mapping of LSH function. + */ + protected[this] val hashFunction: KeyType => Vector + + /** + * :: DeveloperApi :: + * + * Calculate the distance between two different keys using the distance metric corresponding + * to the hashFunction + * @param x One of the point in the metric space + * @param y Another the point in the metric space + * @return The distance between x and y in double + */ + protected[ml] def keyDistance(x: KeyType, y: KeyType): Double + + /** + * :: DeveloperApi :: + * + * Calculate the distance between two different hash Vectors. By default, the distance is the + * minimum distance of two hash values in any dimension. + * + * @param x One of the hash vector + * @param y Another hash vector + * @return The distance between hash vectors x and y in double + */ + protected[ml] def hashDistance(x: Vector, y: Vector): Double = { +(x.asBreeze - y.asBreeze).toArray.map(math.abs).min + } + + /** + * Transforms the input dataset. + */ + override def transform(dataset: Dataset[_]): DataFrame = { +transformSchema(dataset.schema, logging = true) +val transformUDF = udf(hashFunction, new VectorUDT) +dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol + } + + /** + * :: DeveloperApi :: + * + * Check transform validity and derive the output schema from the input schema. + * + * Typical implementation should first conduct verification on schema change and parameter + * validity, including complex parameter interaction checks. + */ + override def transformSchema(schema: StructType): StructType = { +transformLSHSchema(schema) + } + + /** + * Given a large dataset and an item, approximately find at most k items which have the closest + * distance to the item. +
[GitHub] spark issue #15155: [SPARK-17477][SQL] SparkSQL cannot handle schema evoluti...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15155 Yea. I meant if we want to read old/new Parquet files without user-given schema with enabling merging schemas, then, we'd face SPARK-15516 first. This is why I thought that JIRA blocks this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14784 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15146 **[Test build #65630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65630/consoleFull)** for PR 15146 at commit [`baf239b`](https://github.com/apache/spark/commit/baf239b69a1e82ef37845857f295fe4df1780b46). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14784 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65626/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14784 **[Test build #65626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65626/consoleFull)** for PR 14784 at commit [`c91d02a`](https://github.com/apache/spark/commit/c91d02a95d8239db5d2d4db7a796a987705a449d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14634 **[Test build #65629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65629/consoleFull)** for PR 14634 at commit [`90fbe4e`](https://github.com/apache/spark/commit/90fbe4e7bc8e80d7601eb020d428055a1a44797a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15146 I guess if using the same analyzed plan increases the chance to reuse exchange, then it may improve the performance. Anyway, it is not the purpose of this change. Because the analyzed subquery plan will be changed largely in optimization later, we cannot guarantee this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65628/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65628/consoleFull)** for PR 13513 at commit [`bddbc7f`](https://github.com/apache/spark/commit/bddbc7f8e1563000ea4a9dcad07c92e34c24199f). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13513 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15155: [SPARK-17477][SQL] SparkSQL cannot handle schema evoluti...
Github user wgtmac commented on the issue: https://github.com/apache/spark/pull/15155 @HyukjinKwon Yup this PR is very similar to yours. For merging parquet schema, it won't work. Think about this: the table contains two parquet files, one has int, one has long. The DataFrame schema uses long (mergeSchema will also result in this case). So when reading the parquet file with Int, we still run into this problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15146: [SPARK-17590][SQL] Analyze CTE definitions at once and a...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15146 @hvanhovell This is for analyzer change and adds CTE in CTE feature. I don't expect there is performance improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13513: [SPARK-15698][SQL][Streaming] Add the ability to remove ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13513 **[Test build #65628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65628/consoleFull)** for PR 13513 at commit [`bddbc7f`](https://github.com/apache/spark/commit/bddbc7f8e1563000ea4a9dcad07c92e34c24199f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14803 @marmbrus > * I think that for all but text you have to include the partition columns in the schema if inference is turned off (which it is by default). For text format, when inference is turned off but there is user provided schema, we will use the schema. In this case, I think user should also include the partition columns in the schema, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65627 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65627/consoleFull)** for PR 14803 at commit [`541dfdc`](https://github.com/apache/spark/commit/541dfdc637b5373c384249a601d0a3e8486adb07). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15156: [SPARK-17160] Properly escape field names in code...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15156 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15156: [SPARK-17160] Properly escape field names in code-genera...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/15156 I'm going to merge this to master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13513: [SPARK-15698][SQL][Streaming] Add the ability to ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/13513#discussion_r79530152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming + +import java.util.{LinkedHashMap => JLinkedHashMap} +import java.util.Map.Entry + +import scala.collection.mutable + +import org.json4s.NoTypeHints +import org.json4s.jackson.Serialization + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.FileStreamSource.FileEntry +import org.apache.spark.sql.internal.SQLConf + +class FileStreamSourceLog( +metadataLogVersion: String, +sparkSession: SparkSession, +path: String) + extends CompactibleFileStreamLog[FileEntry](metadataLogVersion, sparkSession, path) { + + import CompactibleFileStreamLog._ + + // Configurations about metadata compaction + protected override val compactInterval = + sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL) + require(compactInterval > 0, +s"Please set ${SQLConf.FILE_SOURCE_LOG_COMPACT_INTERVAL.key} (was $compactInterval) to a " + + s"positive value.") + + protected override val fileCleanupDelayMs = +sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_CLEANUP_DELAY) + + protected override val isDeletingExpiredLog = +sparkSession.conf.get(SQLConf.FILE_SOURCE_LOG_DELETION) + + private implicit val formats = Serialization.formats(NoTypeHints) + + // A fixed size log entry cache to cache the file entries belong to the compaction batch. It is + // used to avoid scanning the compacted log file to retrieve it's own batch data. + private val cacheSize = compactInterval + private val fileEntryCache = new JLinkedHashMap[Long, Array[FileEntry]] { +override def removeEldestEntry(eldest: Entry[Long, Array[FileEntry]]): Boolean = { + size() > cacheSize +} + } + + protected override def serializeData(data: FileEntry): String = { +Serialization.write(data) + } + + protected override def deserializeData(encodedString: String): FileEntry = { +Serialization.read[FileEntry](encodedString) + } + + def compactLogs(logs: Seq[FileEntry]): Seq[FileEntry] = { +logs + } + + override def add(batchId: Long, logs: Array[FileEntry]): Boolean = { +if (super.add(batchId, logs) && isCompactionBatch(batchId, compactInterval)) { --- End diff -- yes, you're right, I will fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-cluster...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/14639 Close it as it is resolved somewhere else. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14639: [SPARK-17054][SPARKR] SparkR can not run in yarn-...
Github user zjffdu closed the pull request at: https://github.com/apache/spark/pull/14639 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14784: [SPARK-17210][SPARKR] sparkr.zip is not distributed to e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14784 **[Test build #65626 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65626/consoleFull)** for PR 14784 at commit [`c91d02a`](https://github.com/apache/spark/commit/c91d02a95d8239db5d2d4db7a796a987705a449d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65625/consoleFull)** for PR 14803 at commit [`5b101ab`](https://github.com/apache/spark/commit/5b101aba62efd34077495eb55159ec1b93d2c90e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15157 **[Test build #65624 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65624/consoleFull)** for PR 15157 at commit [`5b73205`](https://github.com/apache/spark/commit/5b732058ac911b6cb52a8639281681c3ee9d9dae). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14834 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65622/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14834 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14834: [SPARK-17163][ML] Unified LogisticRegression interface
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14834 **[Test build #65622 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65622/consoleFull)** for PR 14834 at commit [`4dae595`](https://github.com/apache/spark/commit/4dae59569732ace5cb2cf583d6db315fb3eda596). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15157: Revert "[SPARK-17549][SQL] Only collect table size stat ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15157 cc @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15157: Revert "[SPARK-17549][SQL] Only collect table siz...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/15157 Revert "[SPARK-17549][SQL] Only collect table size stat in driver for cached relation." This reverts commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc because of the problem mentioned at https://issues.apache.org/jira/browse/SPARK-17549?focusedCommentId=15505060=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15505060 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark revert-SPARK-17549 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15157 commit 5b732058ac911b6cb52a8639281681c3ee9d9dae Author: Yin HuaiDate: 2016-09-20T02:58:30Z Revert "[SPARK-17549][SQL] Only collect table size stat in driver for cached relation." This reverts commit 39e2bad6a866d27c3ca594d15e574a1da3ee84cc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14803 > * If the partition directories are not present when the stream starts then I believe this breaks. Yes. Schema inference only happens when starting the stream. > * I think that for all but text you have to include the partition columns in the schema if inference is turned off (which it is by default). I will update this to the programming guide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14634 This change looks good. Let's add a regression test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15054: [SPARK-17502] [SQL] Fix Multiple Bugs in DDL Stat...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15054#discussion_r79528505 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -357,6 +346,21 @@ class SessionCatalog( tempTables.remove(formatTableName(name)) } + /** + * Retrieve the metadata of an existing temporary view. + * If the temporary view does not exist, return None. + */ + def getTempViewMetadataOption(name: String): Option[CatalogTable] = synchronized { --- End diff -- Seems we always use it with the pattern `getTempViewMetadataOption.getOrElse(getTableMetadata)`, maybe we should just rename it to `getTempViewOrPermanentTableMetadata`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14803: [SPARK-17153][SQL] Should read partition data whe...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14803#discussion_r79526950 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala --- @@ -608,6 +608,34 @@ class FileStreamSourceSuite extends FileStreamSourceTest { // === other tests + test("read new files in partitioned table without globbing, should read partition data") { --- End diff -- Added a test for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14803: [SPARK-17153][SQL] Should read partition data when readi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14803 **[Test build #65623 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65623/consoleFull)** for PR 14803 at commit [`23ba9a2`](https://github.com/apache/spark/commit/23ba9a23ab835987ed326a9320cf8632a0783885). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14634 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65620/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14634 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14634: [SPARK-17051][SQL] we should use hadoopConf in InsertInt...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14634 **[Test build #65620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65620/consoleFull)** for PR 14634 at commit [`64268f3`](https://github.com/apache/spark/commit/64268f34191d9f5447a63f34e53c9663aac714e2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r79521372 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala --- @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.sql.{Date, Timestamp} + +import org.apache.spark.sql.{AnalysisException, Row} +import org.apache.spark.sql.catalyst.plans.logical.BasicColStats +import org.apache.spark.sql.execution.command.AnalyzeColumnCommand +import org.apache.spark.sql.types._ + +class StatisticsColumnSuite extends StatisticsTest { + + test("parse analyze column commands") { +val table = "table" +assertAnalyzeCommand( + s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value", + classOf[AnalyzeColumnCommand]) + +val noColumnError = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS") +} +assert(noColumnError.message == "Need to specify the columns to analyze. Usage: " + + "ANALYZE TABLE tbl COMPUTE STATISTICS FOR COLUMNS key, value") + +withTable(table) { + sql(s"CREATE TABLE $table (key INT, value STRING)") + val invalidColError = intercept[AnalysisException] { +sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS k") + } + assert(invalidColError.message == s"Invalid column name: k") + + val duplicateColError = intercept[AnalysisException] { +sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value, key") + } + assert(duplicateColError.message == s"Duplicate column name: key") + + withSQLConf("spark.sql.caseSensitive" -> "true") { +val invalidErr = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS keY") +} +assert(invalidErr.message == s"Invalid column name: keY") + } + + withSQLConf("spark.sql.caseSensitive" -> "false") { +val duplicateErr = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value, vaLue") +} +assert(duplicateErr.message == s"Duplicate column name: vaLue") + } +} + } + + test("basic statistics for integral type columns") { +val rdd = sparkContext.parallelize(Seq("1", null, "2", "3", null)).map { i => + if (i != null) Row(i.toByte, i.toShort, i.toInt, i.toLong) else Row(i, i, i, i) --- End diff -- Cool, please add some salt to this when you fix (as I don't think mine is perfect anyway :)). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15034: [SPARK-16240][ML] ML persistence backward compatibility ...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/15034 @jkbradley, it looks like this is legitimately failing MiMa (not sure why it passed on the first run...): ``` [error] * the type hierarchy of object org.apache.spark.ml.clustering.LDA is different in current version. Missing types {org.apache.spark.ml.util.DefaultParamsReadable} [error]filter with: ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.ml.clustering.LDA$") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15090#discussion_r79520564 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala --- @@ -0,0 +1,228 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.sql.{Date, Timestamp} + +import org.apache.spark.sql.{AnalysisException, Row} +import org.apache.spark.sql.catalyst.plans.logical.BasicColStats +import org.apache.spark.sql.execution.command.AnalyzeColumnCommand +import org.apache.spark.sql.types._ + +class StatisticsColumnSuite extends StatisticsTest { + + test("parse analyze column commands") { +val table = "table" +assertAnalyzeCommand( + s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value", + classOf[AnalyzeColumnCommand]) + +val noColumnError = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS") +} +assert(noColumnError.message == "Need to specify the columns to analyze. Usage: " + + "ANALYZE TABLE tbl COMPUTE STATISTICS FOR COLUMNS key, value") + +withTable(table) { + sql(s"CREATE TABLE $table (key INT, value STRING)") + val invalidColError = intercept[AnalysisException] { +sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS k") + } + assert(invalidColError.message == s"Invalid column name: k") + + val duplicateColError = intercept[AnalysisException] { +sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value, key") + } + assert(duplicateColError.message == s"Duplicate column name: key") + + withSQLConf("spark.sql.caseSensitive" -> "true") { +val invalidErr = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS keY") +} +assert(invalidErr.message == s"Invalid column name: keY") + } + + withSQLConf("spark.sql.caseSensitive" -> "false") { +val duplicateErr = intercept[AnalysisException] { + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value, vaLue") +} +assert(duplicateErr.message == s"Duplicate column name: vaLue") + } +} + } + + test("basic statistics for integral type columns") { +val rdd = sparkContext.parallelize(Seq("1", null, "2", "3", null)).map { i => + if (i != null) Row(i.toByte, i.toShort, i.toInt, i.toLong) else Row(i, i, i, i) --- End diff -- @HyukjinKwon Seems better. Let me change the code based on this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org