[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-929571635 Thanks @JoshRosen ! These are some great analysis! > I think we'll also run into similar problems in the Maven build. According to Maven's build lifecycle docs: I've completely missed this 臘 . Yes adding the `hive-shaded` module in Spark will not be a good idea given the above reasons on SBT and Maven test lifecycle, and now I understand why other projects put the shaded library in a different repo :) Let me spend more time to revisit the following two paths: 1. shade all the dependencies in Hive (e.g., via `hive-exec` fat jar) and make a new release, so Spark can start using that. 2. create a ASF repo such as `spark-thirdparty` following the examples from HBase & Hadoop. This needs community discussion as you mentioned, and I'm not sure how much more burden it will add to Spark's maintenance procedure. > There's a tricky corner-case if a user has manually built a metastore classpath which includes only the dependencies not already provided by Spark Thanks for the detailed explanation on how the `IsolatedClientLoader` works, and I agree this is a minor issue we should be aware of. We can either put something on the release notes, or perhaps exclude unshaded Guava jar completely from the Spark distribution (for `hadoop-3.2`). Currently this appears to be blocked by the `curator-client` dependency as discussed earlier in the PR, and perhaps there is still a way to ship only shaded Guava (from `network-common`) with those few classes required by `curator-client` excluded from relocation. > One more consideration: what about Hadoop 2.7 builds? Another good question :) You are right that Hadoop 2.7 still uses unshaded Guava, while Hadoop 3.3.1 has switched to use shaded Guava via HADOOP-17288. In addition Spark is using shaded Hadoop client from HADOOP-11804 which further relocates other Hadoop dependencies so they won't pollute Spark's classpath. I think one approach is to keep Guava 14.0.1 for `hadoop-2.7` profile so everything still stay the same there. This at least will unblock us from upgrading Guava for the default `hadoop-3.2` profile, and make sure all the published Spark artifacts will get the newer version of Guava. Also the aforementioned idea of excluding unshaded Guava from Spark distribution will only apply for the latter. A crazier idea is to shade Hadoop 2.7 also if we are going with the `spark-thirdparty` approach but I'm not sure whether it's worth it given we are going to deprecate `hadoop-2.7` eventually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-926976194 Hmm interesting. After I changed the isolated class loader to pick guava classes from Hive jars (which is of 14.0.1), tests started to fail because it now seems to uses Spark's built-in Guava which is 30.1.1-jre. This doesn't seem to make sense. ``` [error] sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to access method com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator; from class org.apache.hadoop.hive.ql.exec.FetchOperator [error] at org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108) [error] at org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87) [error] at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541) [error] at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) [error] at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) [error] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) [error] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) [error] at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:831) ``` `Iterators.emptyIterator` here is no longer public in the newer versions of guava. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-926818975 > Do we know how the Guava version upgrade will interact with the IsolatedClientLoader used for metastore clients? This also crossed my mind too. The related unit tests (e.g., `VersionsSuite`) are all happy at the moment but there could still be unknowns I'm afraid. > Maybe we can remove this from isSharedClass; I'm not sure. Yes, I'm also thinking to make guava non-shared classes here so the class loader will instead pick the ones from `hive-exec` jar (which is a fat with all the guava classes). Ideally `hive-exec` should shade Guava by itself but this is only done in the latest 2.3.x and master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-926076885 retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-924482103 > One question. By doing this, does the existing profile hive-provided still work? @viirya good catch, I forgot the `hive-provided` profile. I think it should still apply to the shaded jar but let me try it out. When the dependency is of provided scope, the shaded jar should not include the classes from it. When the profile is activated, I guess we'll expect users to provide their own Hive jars and they'll be on their own for any potential Guava conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-924479708 > Another question, the build/mvn command works well with the new module spark-hive-shaded, but my IDEA does not seem to recognize the classes from the spark-hive-shaded jar. Does IDEA work well for you? Good catch @pan3793. I forgot to add the following to the hive shaded pom: ``` true true ``` It seems IDEA doesn't have good support for resolving shaded classes, and I have to refer to the workaround in [here](https://youtrack.jetbrains.com/issue/IDEA-126596), by ignoring the `hive-shaded` project. But let me see if I can find a better way to do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-919705312 Makes sense. Let me try to relocate those too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre
sunchao commented on pull request #33989: URL: https://github.com/apache/spark/pull/33989#issuecomment-919682167 @pan3793 yes that's correct, I'm not relocating all the dependencies because the others haven't caused any issue so far AFAIK (after all, Spark today already leaks these dependencies). We can definitely include them if necessary, just need to check if they are used in APIs and whether that causes any breakage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org