[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-28 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-929571635


   Thanks @JoshRosen ! These are some great analysis!
   
   > I think we'll also run into similar problems in the Maven build. According 
to Maven's build lifecycle docs:
   
   I've completely missed this 臘 . Yes adding the `hive-shaded` module in Spark 
will not be a good idea given the above reasons on SBT and Maven test 
lifecycle, and now I understand why other projects put the shaded library in a 
different repo :)
   
   Let me spend more time to revisit the following two paths:
   1. shade all the dependencies in Hive (e.g., via `hive-exec` fat jar) and 
make a new release, so Spark can start using that.
   2. create a ASF repo such as `spark-thirdparty` following the examples from 
HBase & Hadoop. This needs community discussion as you mentioned, and I'm not 
sure how much more burden it will add to Spark's maintenance procedure.
   
   > There's a tricky corner-case if a user has manually built a metastore 
classpath which includes only the dependencies not already provided by Spark
   
   Thanks for the detailed explanation on how the `IsolatedClientLoader` works, 
and I agree this is a minor issue we should be aware of. We can either put 
something on the release notes, or perhaps exclude unshaded Guava jar 
completely from the Spark distribution (for `hadoop-3.2`). Currently this 
appears to be blocked by the `curator-client` dependency as discussed earlier 
in the PR, and perhaps there is still a way to ship only shaded Guava (from 
`network-common`) with those few classes required by `curator-client` excluded 
from relocation.
   
   > One more consideration: what about Hadoop 2.7 builds?
   
   Another good question :) You are right that Hadoop 2.7 still uses unshaded 
Guava, while Hadoop 3.3.1 has switched to use shaded Guava via HADOOP-17288. In 
addition Spark is using shaded Hadoop client from HADOOP-11804 which further 
relocates other Hadoop dependencies so they won't pollute Spark's classpath. 
   
   I think one approach is to keep Guava 14.0.1 for `hadoop-2.7` profile so 
everything still stay the same there. This at least will unblock us from 
upgrading Guava for the default `hadoop-3.2` profile, and make sure all the 
published Spark artifacts will get the newer version of Guava. Also the 
aforementioned idea of excluding unshaded Guava from Spark distribution will 
only apply for the latter.
   
   A crazier idea is to shade Hadoop 2.7 also if we are going with the 
`spark-thirdparty` approach but I'm not sure whether it's worth it given we are 
going to deprecate `hadoop-2.7` eventually.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-24 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-926976194


   Hmm interesting. After I changed the isolated class loader to pick guava 
classes from Hive jars (which is of 14.0.1), tests started to fail because it 
now seems to uses Spark's built-in Guava which is 30.1.1-jre. This doesn't seem 
to make sense.
   
   ```
   [error] sbt.ForkMain$ForkError: java.lang.IllegalAccessError: tried to 
access method 
com.google.common.collect.Iterators.emptyIterator()Lcom/google/common/collect/UnmodifiableIterator;
 from class org.apache.hadoop.hive.ql.exec.FetchOperator
   [error]  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.(FetchOperator.java:108)
   [error]  at 
org.apache.hadoop.hive.ql.exec.FetchTask.initialize(FetchTask.java:87)
   [error]  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:541)
   [error]  at 
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
   [error]  at 
org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
   [error]  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
   [error]  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
   [error]  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:831)
   ```
   
   `Iterators.emptyIterator` here is no longer public in the newer versions of 
guava.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-24 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-926818975


   > Do we know how the Guava version upgrade will interact with the 
IsolatedClientLoader used for metastore clients?
   
   This also crossed my mind too. The related unit tests (e.g., 
`VersionsSuite`) are all happy at the moment but there could still be unknowns 
I'm afraid.
   
   > Maybe we can remove this from isSharedClass; I'm not sure.
   
   Yes, I'm also thinking to make guava non-shared classes here so the class 
loader will instead pick the ones from `hive-exec` jar (which is a fat with all 
the guava classes). Ideally `hive-exec` should shade Guava by itself but this 
is only done in the latest 2.3.x and master. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-23 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-926076885


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-21 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-924482103


   > One question. By doing this, does the existing profile hive-provided still 
work?
   
   @viirya good catch, I forgot the `hive-provided` profile. I think it should 
still apply to the shaded jar but let me try it out. When the dependency is of 
provided scope, the shaded jar should not include the classes from it.
   
   When the profile is activated, I guess we'll expect users to provide their 
own Hive jars and they'll be on their own for any potential Guava conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-21 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-924479708


   > Another question, the build/mvn command works well with the new module 
spark-hive-shaded, but my IDEA does not seem to recognize the classes from the 
spark-hive-shaded jar. Does IDEA work well for you?
   
   Good catch @pan3793. I forgot to add the following to the hive shaded pom:
   ```
   true
   true
   ```
   It seems IDEA doesn't have good support for resolving shaded classes, and I 
have to refer to the workaround in 
[here](https://youtrack.jetbrains.com/issue/IDEA-126596), by ignoring the 
`hive-shaded` project. But let me see if I can find a better way to do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-14 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-919705312


   Makes sense. Let me try to relocate those too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] sunchao commented on pull request #33989: [SPARK-36676][SQL][BUILD] Create shaded Hive module and upgrade Guava version to 30.1.1-jre

2021-09-14 Thread GitBox


sunchao commented on pull request #33989:
URL: https://github.com/apache/spark/pull/33989#issuecomment-919682167


   @pan3793 yes that's correct, I'm not relocating all the dependencies because 
the others haven't caused any issue so far AFAIK (after all, Spark today 
already leaks these dependencies). We can definitely include them if necessary, 
just need to check if they are used in APIs and whether that causes any 
breakage. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org