[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user qiaohaijun commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-60730251 I will try it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-60426160 Okay, I've merged this to master. Will file a PR shortly to fix the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2241 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-60148210 @marmbrus Thanks for the comments. Given that we have to support hive-0.12. There are two approaches I can think out to address the issue. 1st: we can temporally make the compatibility test as hive12 only in pom, and find a good way as followup to add corresponding compatibility test for hive0.13 in an elegant way. This approach also unblock some other jiras and build the foundation for further development of hive 0.13 feature support. 2nd: we can create a set of separate files for hive-0.13, e.g., compatibility suite, golden plan, golden answer, which may involve more than hundreds of files. In addition we need to change the current basic hive test code to adapt to different hive version. I think this approach may be a little rush, and also make the scope of this PR really big and hard to maintain. I prefer the first approach, and opening followup jiras to address leftovers in a more granular way. Please let know how you do think about it. If you have other options, please also let me know. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59884283 @marmbrus in #2499, i reproduce the golden answer and changed some *.ql because of 0.13 changes, the tests passed in my local machine. @zhzhan not get you, why to replace the query play? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59887284 @scwf I mean change some *.ql you already did. The problem is that it need to add another layer to take care of compatibility test suite. I have not found a good way to do it. I will thank again to see whether there is a simple way to make it work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59888325 @scwf I am wondering how do you handle the decimal support, since hive-0.13 has new semantic for this type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59889646 @marmbrus FYI: I ran the compatibility test, and so far the major outstanding issues include 1st: decimal support, 2nd: udf7 and udf_round, which can be fixed, but I am not 100% sure it is the right way. Most of other failures are false positive and should be solved by regenerating golden answer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59962787 A few comments: I'm not talking only about getting the current tests to pass, but upgrading the test set to include the new files. Also, I hope to update the whitelist to include any new tests that are now passing. I'm not particularly concerned about matching every small detail (i.e. if we don't need to match the empty comment field on metadata based on version). What is important is that we can connect to both Hive 1213 metastores and that queries run and return the correct answers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59837979 @scwf I did some basic functionality testing with you thrift patch, and it looks ok to me. By the way, because the 0.13.1 customized package is not available now, so I revert the pom back for testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59839556 Thanks, if you have any comment, let me know:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59879567 We can reproduce the golden answer for hive 0.13 as i done in my closed PR, how about that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59880136 @scwf, which PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59880350 @scwf The golden answer is different in hive12 and hive13. We need some extra shim layer to handle that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59880636 @marmbrus I think he refers to https://github.com/apache/spark/pull/2499 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59881210 @scwf Did you also replace the query plan for hive0.13 in your another PR? because I also saw some query plan changes in hive0.13. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18944242 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -288,23 +290,26 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - - val results = new JArrayList[String] + val results = HiveShim.createDriverResultsArray val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close() throw new QueryExecutionException(response.getErrorMessage) } driver.setMaxRows(maxRows) driver.getResults(results) - driver.destroy() - results + driver.close() + results.map { r = +r match { + case s: String = s + case o = o.toString --- End diff -- Here ```r``` maybe a ```Array``` type(https://github.com/scwf/hive/blob/branch-0.13/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchFormatter.java#L53-L64), we should cover that case, otherwise this will lead to console result printed as follows: ``` result [Object@5e41108b ``` And on the other hand i suggest that we should do some tests with this PR merged with #2685 to check the basic functionality --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18969029 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -288,23 +290,26 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - - val results = new JArrayList[String] + val results = HiveShim.createDriverResultsArray val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close() throw new QueryExecutionException(response.getErrorMessage) } driver.setMaxRows(maxRows) driver.getResults(results) - driver.destroy() - results + driver.close() + results.map { r = +r match { + case s: String = s + case o = o.toString --- End diff -- @scwf Thanks for the review. The reason I did this is that in hive testing code, I actually didn't find a case which the result value is not Array[String], and there the results is even initialized as Array[String]. For for the safety reason, I will change the code to process Array[Arrray[Object]]. This patch is independent with thrift server, but the thrift server patch should be verified after this one going to upstream, mainly due to pom file change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58994264 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21718/consoleFull) for PR 2241 at commit [`cbb4691`](https://github.com/apache/spark/commit/cbb46911f881a0704fa16a36f2a362a930db6ade). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58995591 @marmbrus from a build perspective this LGTM with the caveat that right now it's only passing Hive compatibility for 0.12 tests and may require further modification to actually pass 0.13 tests. Up to you in terms of whether that blocks merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59001081 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21718/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59001076 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21718/consoleFull) for PR 2241 at commit [`cbb4691`](https://github.com/apache/spark/commit/cbb46911f881a0704fa16a36f2a362a930db6ade). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)` * `class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-59145387 @marmbrus I actually quite exhaustively tested the code in both unit test and system test in sandbox, and real cluster, and didn't see major issues. Regarding the compatibility test, there are several test case failure due to some hive 0.13 internal behavior change, e.g, hive decimal. We can fix it in the follow up. In my point of view, it would be good to take a accumulative approach. The current patch does not have impact on existing hive 12 support, but enable the community to actively improve hive0.13 support. Some instant benefits: 1st. Native parquet support, 2nd. some new UDFs in hive 0.13, and 3rd: better support for ORC as source, e.g., compression, predictor push down, etc. Please let me know if you have any other concerns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58849802 @zhzhan @scwf - I think this should be okay now for protobuf. We made some other changes this week updating the protobuf version to be based on protobuf 2.5 instead of 2.4 in akka. So now throughout Spark we use this. Mind rebasing this? I think the protobuf issue will go away. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58850172 Ok, thanks for that, i will also test it in https://github.com/apache/spark/pull/2685 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58853725 Hi @zhzhan and @scwf - I made some changes to the build to simplify it a bit. I made a PR into your branch. I tested it locally compiling for 0.12 and 0.13, but it would be good if you tested it as well to make sure it works. https://github.com/zhzhan/spark/pull/1/files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58854626 Note @scwf there are some TODO's in there that need to be addressed in your patch for JDBC. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58927724 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18782903 --- Diff: docs/_config.yml --- @@ -1,7 +1,5 @@ highlighter: pygments markdown: kramdown -gems: - - jekyll-redirect-from --- End diff -- This particular change we should revert - I accidentally added this from my branch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58928624 @pwendell Thanks a lot for the help. But I have several concerns. 1. Not sure how the new package is created. But hive-0.13.1 has native parquet support, and I remember I met binary compatibility issue when artifactIdparquet-hive-bundle/artifactId is included. So I exclude this one form hive-0.13.1. 2. Should Hive-0.13.1 is not compatible with hive-0.12.0. So hive-0.13.1 does not support all compatibility test suite yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58929975 @zhzhan could you give a reproduction of the compatiblity issue? around parquet support? It didn't occur at compile time but maybe there is an issue at runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58932508 @pwendell following is the trace from arquetMetastoreSuite, and it is caused by the parquet-hive-bundle sbt/sbt -Phive-0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test-only org.apache.spark.sql.parquet.ParquetMetastoreSuite SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 11:11:50.877 ERROR hive.ql.exec.DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:297) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:273) at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:93) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35) at org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:44) at org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:361) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:361) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:104) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58934179 I see - so what exactly is the problem - is the issue that the Parquet serde is not compatible across versions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58935643 I think because the bundle has the package org.apache.hadoop.hive.ql.io.parquet, which is also now natively located in hive, and results in package conflicts. The way I fixed it is to remove this dependency parquet-hive-bundle in hive-0.13.1. With this, I didn't see any problem with parquet anymore. Please comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18786749 --- Diff: assembly/pom.xml --- @@ -205,6 +211,21 @@ /dependencies /profile profile + idhive-versions/id --- End diff -- I think this can be removed - right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18786888 --- Diff: assembly/pom.xml --- @@ -205,6 +211,21 @@ /dependencies /profile profile + idhive-versions/id --- End diff -- We already require the `-Phive` profile to add hive to the assembly, so I think that is sufficient. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58937829 Okay then how about this to keep it simple: 1. Add the `parquet-hive-bundle` dependency inside of `hive/pom.xml` inside of a block for the `hive-0.12-0` profile only. 2. Change the default values of `hive.version`, `hive.short.version`, etc. to 0.13.1 3. Update the instructions as follows: ``` # Apache Hadoop 2.4.X with Hive 13 support mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package # Apache Hadoop 2.4.X with Hive 12 support mvn -Pyarn -Phive-0.12.0 -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package ``` The issue is that we can't rely on dynamic profile activation because it messes up people linking against the `spark-hive` artifact. Many build tools like SBT do not support loading profiles in poms from other projects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58938511 @pwendell Sounds good to me. Will update the patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58938955 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amp.lab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21692/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58943420 Thanks! Could you elaborate more on the second point you were making? About the compatibility tests? Is Hive 0.13 no language-compatible with Hive 0.12? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58956296 @pwendell With this new patch, we actually have to change the test script to pass all hive related test suite. Need the help from somebody familiar with the test infra to add -Phive-0.12.0. (is the change of dev/run-tests enough?) Regarding the compatibility test, the query plan in hive0.13 is different from hive 0.12. In the meantime, there is some new feature and internal change, for example decimal support, GenericUDFRound etc. But their scope is much smaller than this one and we can fix it in the following up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58979166 @zhzhan yeah you can just change `dev/run-tests` to build with the hive 12 profile for now where it currently has `-Phive`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18805288 --- Diff: sql/hive/pom.xml --- @@ -155,6 +162,25 @@ artifactIdscalatest-maven-plugin/artifactId /plugin + plugin --- End diff -- The indentation seems off here. It might be my fault from my earlier patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58993065 Test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-58993918 Hmmm Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18496664 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -287,22 +289,20 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - val results = new JArrayList[String] --- End diff -- Need consider compatibility here since ```driver.getResults(results)``` api changed in hive-0.13.1. You can refer my change there(https://github.com/scwf/spark/blob/hive-0.13.1-clean/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala#L295-L310) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18433803 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -80,8 +81,10 @@ class StatisticsSuite extends QueryTest with BeforeAndAfterAll { sql(INSERT INTO TABLE analyzeTable SELECT * FROM src).collect() sql(INSERT INTO TABLE analyzeTable SELECT * FROM src).collect() -assert(queryTotalSize(analyzeTable) === defaultSizeInBytes) - +// TODO: How it works? needs to add it back for other hive version. +if (HiveShim.version ==0.12.0) { --- End diff -- I don't quite understand this part, but in my testing the new value is always 11624. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18433810 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) --- End diff -- Change the comments, and make the parser accept any decimal(precision, scale). But the gap is still there between hive0.13.1 and spark support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18433827 --- Diff: sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.net.URI +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.`type`.HiveDecimal +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Hive, Partition, Table} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors._ +import org.apache.hadoop.hive.ql.stats.StatsSetupConst +import org.apache.hadoop.hive.serde2.{Deserializer, ColumnProjectionUtils} +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.mapred.InputFormat +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.12.0. + */ +private[hive] object HiveShim { + val version = 0.12.0 + val metastoreDecimal = decimal + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(serdeClass, inputFormatClass, outputFormatClass, properties) --- End diff -- we need return TableDesc to MetastoreRelation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18433838 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, path: Path) = { +context.getExternalTmpPath(path.toUri) + } + + def getDataLocationPath(p: Partition) = p.getDataLocation + + def getAllPartitionsOf(client: Hive, tbl: Table) = client.getAllPartitionsOf(tbl) + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. + * */ + implicit def wrapperToFileSinkDesc(w: ShimFileSinkDesc): FileSinkDesc = { --- End diff -- I
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18433848 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, path: Path) = { +context.getExternalTmpPath(path.toUri) + } + + def getDataLocationPath(p: Partition) = p.getDataLocation + + def getAllPartitionsOf(client: Hive, tbl: Table) = client.getAllPartitionsOf(tbl) + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. --- End diff -- Not sure how hive handle this issue, but FileSinkDesc is certainly not serializable with Path
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18434383 --- Diff: pom.xml --- @@ -1260,7 +1259,18 @@ /dependency /dependencies /profile - +profile + idhive-default/id + activation +property + name!hive.version/name +/property + /activation + properties +hive.version0.12.0/hive.version +derby.version10.4.2.0/derby.version + /properties +/profile profile idhive/id --- End diff -- Do we need profile hive? Since we only support hive-0.12.0 and hive-0.13.1, i think we can provide two profile here is ok: profile hive-0.12 for version 0.12.0 and profile hive-0.13 for version 0.13.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57939048 I tested with @pwendell shaded hive-0.13.1, also has this problem: Exception in thread main java.lang.ClassNotFoundException: com.google.protobuf_spark.GeneratedMessage Hi @pwendell, i think we need regenerate OrcProto.java,Here is ```org.spark_project.protobuf.ExtensionRegistry``` in https://github.com/pwendell/hive/blob/0.13-shaded-protobuf/ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java#L9 It should be ```com.google.protobuf_spark.ExtensionRegistry registry```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57939320 I made a shaded hive-0.13.1 version several days ago for testing(https://github.com/scwf/hive/tree/0.13.1-shaded).Hope it is useful:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18439490 --- Diff: pom.xml --- @@ -1260,7 +1259,18 @@ /dependency /dependencies /profile - +profile + idhive-default/id + activation +property + name!hive.version/name +/property + /activation + properties +hive.version0.12.0/hive.version +derby.version10.4.2.0/derby.version + /properties +/profile profile idhive/id --- End diff -- The original patch took this way, but with this new approach, it can support different hive versions without changing pom file further. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18429655 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, path: Path) = { +context.getExternalTmpPath(path.toUri) + } + + def getDataLocationPath(p: Partition) = p.getDataLocation + + def getAllPartitionsOf(client: Hive, tbl: Table) = client.getAllPartitionsOf(tbl) + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. --- End diff -- I am pretty confused about it. I think Hive needs to serialize FileSinkDesc when the query plan
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18429712 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, path: Path) = { +context.getExternalTmpPath(path.toUri) + } + + def getDataLocationPath(p: Partition) = p.getDataLocation + + def getAllPartitionsOf(client: Hive, tbl: Table) = client.getAllPartitionsOf(tbl) + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. + * */ + implicit def wrapperToFileSinkDesc(w: ShimFileSinkDesc): FileSinkDesc = { +var f = new
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18429736 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, path: Path) = { +context.getExternalTmpPath(path.toUri) + } + + def getDataLocationPath(p: Partition) = p.getDataLocation + + def getAllPartitionsOf(client: Hive, tbl: Table) = client.getAllPartitionsOf(tbl) + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. + * */ + implicit def wrapperToFileSinkDesc(w: ShimFileSinkDesc): FileSinkDesc = { --- End diff -- If we
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18429781 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { --- End diff -- ``` if () { ... } else { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18429783 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} +appendReadColumnNames(conf, names) --- End diff -- Why no null and empty check at here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430042 --- Diff: sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.net.URI +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.`type`.HiveDecimal +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Hive, Partition, Table} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors._ +import org.apache.hadoop.hive.ql.stats.StatsSetupConst +import org.apache.hadoop.hive.serde2.{Deserializer, ColumnProjectionUtils} +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.mapred.InputFormat +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.12.0. + */ +private[hive] object HiveShim { + val version = 0.12.0 + val metastoreDecimal = decimal + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(serdeClass, inputFormatClass, outputFormatClass, properties) --- End diff -- Is it necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430235 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala --- @@ -369,6 +371,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { * tests. */ protected val originalUdfs: JavaSet[String] = FunctionRegistry.getFunctionNames + HiveShim.createDefaultDBIfNeeded(this) --- End diff -- Can you add a comment at here to explain why it is necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430262 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala --- @@ -78,6 +79,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { // For some hive test case which contain ${system:test.tmp.dir} System.setProperty(test.tmp.dir, testTempDir.getCanonicalPath) + CommandProcessorFactory.clean(hiveconf) --- End diff -- Since it is a cleanup work, seems it is better to be placed after `System.clearProperty(spark.hostPort)`. Also, please add comment about what this call is doing and why it is needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430322 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -80,8 +81,10 @@ class StatisticsSuite extends QueryTest with BeforeAndAfterAll { sql(INSERT INTO TABLE analyzeTable SELECT * FROM src).collect() sql(INSERT INTO TABLE analyzeTable SELECT * FROM src).collect() -assert(queryTotalSize(analyzeTable) === defaultSizeInBytes) - +// TODO: How it works? needs to add it back for other hive version. +if (HiveShim.version ==0.12.0) { --- End diff -- For Hive 0.13, will table always be updated after `INSERT INTO`? When we added this test, the table size was not updated with the `INSERT INTO` command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430463 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) --- End diff -- Can you double check it? I am not sure DECIMAL in hive-0.12 is actually DECIMAL(10,0). From the code, seems precision is unbounded. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430480 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -212,7 +214,18 @@ private[hive] object HiveQl { /** * Returns the AST for the given SQL string. */ - def getAst(sql: String): ASTNode = ParseUtils.findRootNonNullToken((new ParseDriver).parse(sql)) + def getAst(sql: String): ASTNode = { +/* + * Context has to be passed in in hive0.13.1. --- End diff -- in in --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430557 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) --- End diff -- Let's say we connect to a existing hive 0.13 metastore. If there is a decimal column with a user-defined precision and scale, will we see parsing error? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430590 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -557,11 +557,14 @@ class HiveQuerySuite extends HiveComparisonTest { |WITH serdeproperties('s1'='9') .stripMargin) } -sql(sADD JAR $testJar) -sql( - ALTER TABLE alter1 SET SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' -|WITH serdeproperties('s1'='9') - .stripMargin) +// now only verify 0.12.0, and ignore other versions due to binary compatability --- End diff -- Can you explain it a little bit more? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430600 --- Diff: sql/hive/v0.12.0/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.net.URI +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.`type`.HiveDecimal +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Hive, Partition, Table} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors._ +import org.apache.hadoop.hive.ql.stats.StatsSetupConst +import org.apache.hadoop.hive.serde2.{Deserializer, ColumnProjectionUtils} +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.mapred.InputFormat +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.12.0. + */ +private[hive] object HiveShim { + val version = 0.12.0 + val metastoreDecimal = decimal + + def getTableDesc( +serdeClass: Class[_ : Deserializer], +inputFormatClass: Class[_ : InputFormat[_, _]], +outputFormatClass: Class[_], +properties: Properties) = { +new TableDesc(serdeClass, inputFormatClass, outputFormatClass, properties) + } + + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + + def createDefaultDBIfNeeded(context: HiveContext) ={ } + + /** The string used to denote an empty comments field in the schema. */ + def getEmptyCommentsFieldValue = None + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd(0), conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +new HiveDecimal(bd) + } + + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +ColumnProjectionUtils.appendReadColumnIDs(conf, ids) +ColumnProjectionUtils.appendReadColumnNames(conf, names) + } + + def getExternalTmpPath(context: Context, uri: URI): String = { --- End diff -- It will be good to make the return type consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18430657 --- Diff: sql/hive/pom.xml --- @@ -119,6 +83,74 @@ profiles profile + idhive-default/id + activation +property + name!hive.version/name --- End diff -- If we use modified hive dependencies, can we avoid this error and simplify pom changes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18431391 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import java.util.Properties +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.{Table, Hive, Partition} +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import org.apache.hadoop.hive.serde2.{ColumnProjectionUtils, Deserializer} +import org.apache.hadoop.mapred.InputFormat +import org.apache.spark.Logging +import org.apache.hadoop.{io = hadoopIo} +import scala.collection.JavaConversions._ +import scala.language.implicitConversions + +/** + * A compatibility layer for interacting with Hive version 0.13.1. + */ +private[hive] object HiveShim { + val version = 0.13.1 + /* + * TODO: hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) --- End diff -- Yeah I think you are right, it is unbounded in Hive 12. Spark SQL also will use unbounded precision decimals internally, so when its not specified thats what we should assume. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57574745 @pwendell I think the packaging has some problem, probably in protobuf. I ran some test suite, but cannot go through. With the original package, the test is OK. Following are some example failure case. sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test-only org.apache.spark.sql.hive.CachedTableSuite Caused by: sbt.ForkMain$ForkError: com.google.protobuf_spark.GeneratedMessage at java.net.URLClassLoader$1.run(URLClassLoader.java:366) sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test-only org.apache.spark.sql.hive.execution.HiveQuerySuite [info] ... [info] Cause: java.lang.ClassNotFoundException: com.google.protobuf_spark.GeneratedMessage [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:366) sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test-only org.apache.spark.sql.parquet.ParquetMetastoreSuite [info] ... [info] Cause: java.lang.ClassNotFoundException: com.google.protobuf_spark.GeneratedMessage [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:366) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57580447 @yhuai I removed all unnecessary implicits to make it consistent, but have to keep wrapperToFileSinkDesc because HiveFileFormatUtils.getHiveRecordWriter needs FileSinkDesc type, and also it help to track the internal state change of FileSinkDesc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57273157 Hey @zhzhan I've published a modified version of Hive 0.13 that we can link against. A few benefits is: 1. I fixed the hive-exec jar so it only contains hive packages and not a bunch of other code. 2. I'm using a shaded version of the protobuf dependency (otherwise, this intefers with protobuf found in older hadoop client versions). https://oss.sonatype.org/content/repositories/orgspark-project-1077/org/spark-project/hive/ For now you'll need to add this repo to the build to use it, but if it all works I can just fully publish this to maven central. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57394068 @pwendell Thanks a lot. But I need some confirmation from you. 1. Is it long term decision to have customized hive 0.13 repo org.spark-project.hive 2. Do you want to support both hive 0.13.0 and 0.13.1, or just 0.13.0? My current patch is against hive0.13.1. The package you provided is 0.13.0. There is also API compatibility issue between 0.13.1 and 0.13.0. I can provide a 0.13.0 patch, but this approach will also need a separate package for both 0.13.0 and 0.1.31 if we want to support both. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57420122 Hey @zhzhan thanks for pointing that out I didn't realize. I think it's fine to just support Hive 0.13.1... didn't realize there were major changes. I went ahead and published a modified version of 0.13.1 here: https://oss.sonatype.org/content/repositories/orgspark-project-1079/org/spark-project/hive/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18179030 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -284,24 +287,20 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) - - SessionState.start(sessionState) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - val results = new JArrayList[String] val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close() throw new QueryExecutionException(response.getErrorMessage) } driver.setMaxRows(maxRows) driver.getResults(results) - driver.destroy() + driver.close() --- End diff -- A quick question since I ran into something similar when trying to run things on 0.13: can the driver be re-used after you call `close()`? Because I don't see the driver being removed from the CommandProcessorFactory cache, so another call to `runHive` will reuse the closed driver. (I ran into this when running the `HiveFromSpark` example compiled against Hive 0.13, so it should be easy to check.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18186480 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -284,24 +287,20 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) - - SessionState.start(sessionState) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - val results = new JArrayList[String] val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close() throw new QueryExecutionException(response.getErrorMessage) } driver.setMaxRows(maxRows) driver.getResults(results) - driver.destroy() + driver.close() --- End diff -- My understanding is that the driver is not removed from mapDriver of CommandProcessorFactory, and will be reused. CommandProcessorFactory.clean(hiveConf) will destroy and remove the driver from mapDriver. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57242012 @zhzhan I think those implicits are not necessary. Can you change those? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18195198 --- Diff: sql/hive/pom.xml --- @@ -119,6 +83,74 @@ profiles profile + idhive-default/id + activation +property + name!hive.version/name --- End diff -- What will happen if we have `-Dhive.version=0.12.0`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57264575 @yhuai Can you be more specific regarding the comments: I think those implicits are not necessary. Can you change those? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r18198152 --- Diff: sql/hive/pom.xml --- @@ -119,6 +83,74 @@ profiles profile + idhive-default/id + activation +property + name!hive.version/name --- End diff -- Due to special packaging requirements in hive 0.12.0. This will cause compiling error. For future other versions, the flag is OK as long as the shim layer is provided. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57265563 I meant those implicit classes and methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57011211 @zhzhan Thank you for updating the PR. Can you let me know how to setup local dependencies? I'd like to try it. Thanks:) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57040898 @yhuai Following cmd will build with 0.13.1 support. No extra setup required. Replace any place -Phive in the current building with -Dhive.version=0.13.1. Note that don't use -Phive and -Pthrift-server together with -Dhive.version sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-57041188 @yhuai I tried some 0.13.1 feature, for example collection_list, native parquet file support, limited orc file support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17833372 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import scala.language.implicitConversions +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import scala.collection.JavaConversions._ +import org.apache.spark.Logging +import org.apache.hadoop.hive.serde2.ColumnProjectionUtils +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.Partition +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.mapred.InputFormat +import java.util.Properties +import org.apache.hadoop.hive.serde2.Deserializer + +/*hive-0.13.1 support shimmer layer*/ +object HiveShim { + val version = 0.13.1 + /* + * hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + def getTableDesc(serdeClass: Class[_ : Deserializer], inputFormatClass: Class[_ : InputFormat[_, _]], outputFormatClass: Class[_], properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + /*handle the difference in HiveQuerySuite*/ + def getEmptyCommentsFieldValue = + + def convertCatalystString2Hive(s: String) = s + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + * */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + * */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} else { + appendReadColumnNames(conf, names) +} + } + + /* + * Bug introdiced in hive-0.13. FileSinkDesc is serilizable, but its member path is not. + * Fix it through wrapper. + * */ + implicit def wrapperToFileSinkDesc(w: ShimFileSinkDesc): FileSinkDesc = { +var f = new FileSinkDesc(new Path(w.dir), w.tableInfo, w.compressed) +f.setCompressed(w.compressed) +f.setCompressCodec(w.compressCodec) +f.setCompressType(w.compressType) +f.setTableInfo(w.tableInfo) +f + } + + implicit class wrapperToPartition(p: Partition) { +def getDataLocationPath: Path =
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17834283 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -39,6 +38,8 @@ import org.apache.spark.sql.columnar.InMemoryRelation import org.apache.spark.sql.hive.execution.HiveTableScan import org.apache.spark.util.Utils +import org.apache.spark.sql.hive.HiveShim +import org.apache.spark.sql.hive.HiveShim._ --- End diff -- Yes. It need the implicit methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17865634 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -40,6 +39,9 @@ import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.expressions.Row import org.apache.spark.sql.execution.{SparkPlan, UnaryNode} import org.apache.spark.sql.hive.{HiveContext, MetastoreRelation, SparkHiveHadoopWriter} +import org.apache.spark.sql.hive.{ShimFileSinkDesc = FileSinkDesc, ShimContext = Context} +import org.apache.spark.sql.hive.HiveShim +import org.apache.spark.sql.hive.HiveShim._ --- End diff -- It is for implicit conversion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17832655 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -59,7 +60,7 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with val table = client.getTable(databaseName, tblName) val partitions: Seq[Partition] = if (table.isPartitioned) { -client.getAllPartitionsForPruner(table).toSeq +client.getAllPartitionsOf(table).toSeq --- End diff -- This is just the name change in hive, no functionality change. Please refer to hive-5483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17832668 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -292,24 +294,20 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) - - SessionState.start(sessionState) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - val results = new JArrayList[String] val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close --- End diff -- I follow some example case in hive code base. If driver.destory is used, there are some very weird behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17832786 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala --- @@ -77,7 +78,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { // For some hive test case which contain ${system:test.tmp.dir} System.setProperty(test.tmp.dir, testTempDir.getCanonicalPath) - + CommandProcessorFactory.clean(hiveconf); --- End diff -- This is to cleanup. Otherwise, I observe memory leak in hive0.13.1. Following is the funciton. public static void clean(HiveConf conf) { Driver drv = mapDrivers.get(conf); if (drv != null) { drv.destroy(); } mapDrivers.remove(conf); } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17833036 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -140,7 +141,7 @@ class HadoopTableReader( filterOpt: Option[PathFilter]): RDD[Row] = { val hivePartitionRDDs = partitionToDeserializer.map { case (partition, partDeserializer) = val partDesc = Utilities.getPartitionDesc(partition) - val partPath = partition.getPartitionPath + val partPath = partition.getDataLocationPath --- End diff -- You are right. getPartitionPath is not in hive-0.13.1, but in trunk, and we have to deal with it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17833077 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import scala.language.implicitConversions +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import scala.collection.JavaConversions._ +import org.apache.spark.Logging +import org.apache.hadoop.hive.serde2.ColumnProjectionUtils +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.Partition +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.mapred.InputFormat +import java.util.Properties +import org.apache.hadoop.hive.serde2.Deserializer + +/*hive-0.13.1 support shimmer layer*/ +object HiveShim { + val version = 0.13.1 + /* + * hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + def getTableDesc(serdeClass: Class[_ : Deserializer], inputFormatClass: Class[_ : InputFormat[_, _]], outputFormatClass: Class[_], properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + /*handle the difference in HiveQuerySuite*/ + def getEmptyCommentsFieldValue = + + def convertCatalystString2Hive(s: String) = s + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + * */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty --- End diff -- I didn't see appendReadColumns handles empty strings, and met error in this case during testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17833159 --- Diff: sql/hive/pom.xml --- @@ -119,6 +83,74 @@ profiles profile + idhive-default/id + activation +property + name!hive.version/name +/property + /activation + dependencies +dependency + groupIdcom.twitter/groupId + artifactIdparquet-hive-bundle/artifactId + version1.5.0/version +/dependency +dependency + groupIdorg.spark-project.hive/groupId --- End diff -- For hive-0.12, because there is version conflicts in protobuf, the hive jar has to be shaded. It is OK for hive-0.13.1, but overall the problem has to be fixed in hive. I was told that the packaging issue may be fixed in hive 0.14. Because currently the primary hive version supported is hive 0.12, which use customized jar. After everything is sorted out, and spark moves to support more latest hive version, the pom should be changed accordingly and stabilized in this part. The current solution is not perfect, but it is hard to find a better way to solve this problem in one shot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17833192 --- Diff: sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim.scala --- @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.hadoop.hive.ql.processors.CommandProcessorFactory +import scala.language.implicitConversions +import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc} +import org.apache.hadoop.hive.common.`type`.{HiveDecimal} +import scala.collection.JavaConversions._ +import org.apache.spark.Logging +import org.apache.hadoop.hive.serde2.ColumnProjectionUtils +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hive.ql.Context +import org.apache.hadoop.hive.ql.metadata.Partition +import org.apache.hadoop.{io = hadoopIo} +import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.common.StatsSetupConst +import org.apache.hadoop.mapred.InputFormat +import java.util.Properties +import org.apache.hadoop.hive.serde2.Deserializer + +/*hive-0.13.1 support shimmer layer*/ +object HiveShim { + val version = 0.13.1 + /* + * hive-0.13 support DECIMAL(precision, scale), DECIMAL in hive-0.12 is actually DECIMAL(10,0) + * Full support of new decimal feature need to be fixed in seperate PR. + */ + val metastoreDecimal = decimal(10,0) + def getTableDesc(serdeClass: Class[_ : Deserializer], inputFormatClass: Class[_ : InputFormat[_, _]], outputFormatClass: Class[_], properties: Properties) = { +new TableDesc(inputFormatClass, outputFormatClass, properties) + } + def getStatsSetupConstTotalSize = StatsSetupConst.TOTAL_SIZE + def createDefaultDBIfNeeded(context: HiveContext) ={ +context.runSqlHive(CREATE DATABASE default) +context.runSqlHive(USE default) + } + /*handle the difference in HiveQuerySuite*/ + def getEmptyCommentsFieldValue = + + def convertCatalystString2Hive(s: String) = s + + def getCommandProcessor(cmd: Array[String], conf: HiveConf) = { +CommandProcessorFactory.get(cmd, conf) + } + + def createDecimal(bd: java.math.BigDecimal): HiveDecimal = { +HiveDecimal.create(bd) + } + + /* + * This function in hive-0.13 become private, but we have to do this to walkaround hive bug + * */ + private def appendReadColumnNames(conf: Configuration, cols: Seq[String]) { +val old: String = conf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ) +val result: StringBuilder = new StringBuilder(old) +var first: Boolean = old.isEmpty + +for (col - cols) { + if (first) { +first = false + } + else { +result.append(',') + } + result.append(col) +} +conf.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, result.toString) + } + + /* + * Cannot use ColumnProjectionUtils.appendReadColumns directly, if ids is null or empty + * */ + def appendReadColumns(conf: Configuration, ids: Seq[Integer], names: Seq[String]) { +if (ids != null ids.size 0) { + ColumnProjectionUtils.appendReadColumns(conf, ids) +} else { + appendReadColumnNames(conf, names) --- End diff -- good catch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user gss2002 commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-56185334 We have been using this fix for a few weeks now against Hive 13. The only outstanding issue I see and this could be something larger is the fact that Spark Thrift service doesn't seem to support the hive.server2.enable.doAs = true. It doesn't set proxy user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user zhzhan commented on the pull request: https://github.com/apache/spark/pull/2241#issuecomment-56212828 This patch does not include thrift patch, which will be fixed by other jiras, because I don't want the scope is too big. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17805241 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -170,13 +171,14 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val tableParameters = relation.hiveQlTable.getParameters val oldTotalSize = - Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L) + Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize)). --- End diff -- Don't orphan the `.`. I'd format this as: ```scala Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize)) .map(_.toLong) .getOrElse(0L) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2706][SQL] Enable Spark to support Hive...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/2241#discussion_r17805302 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -292,24 +294,20 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { val cmd_trimmed: String = cmd.trim() val tokens: Array[String] = cmd_trimmed.split(\\s+) val cmd_1: String = cmd_trimmed.substring(tokens(0).length()).trim() - val proc: CommandProcessor = CommandProcessorFactory.get(tokens(0), hiveconf) - - SessionState.start(sessionState) + val proc: CommandProcessor = HiveShim.getCommandProcessor(Array(tokens(0)), hiveconf) proc match { case driver: Driver = - driver.init() - val results = new JArrayList[String] val response: CommandProcessorResponse = driver.run(cmd) // Throw an exception if there is an error in query processing. if (response.getResponseCode != 0) { -driver.destroy() +driver.close --- End diff -- Use `()` on all methods that have side-effects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org