Re: Dropping Apache Spark Hadoop2 Binary Distribution?
On Wed, 5 Oct 2022 at 21:59, Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > hadoop branch-2 has guava dependencies; not sure which one A key lesson there is "never trust google artifacts to be stable at the binary level" Which is a shame, especially as there are some things in the jar (executors in particular) for which there is still no comparable java equivalent. Oh, we've also learned never to export *any* third party class in a public API if possible. Which is also a shame as java language lacks any form of tuple type and I do not want to reimplement all of that. Java 17 records would suffice, though as there's no java.lang.Tuple base type, no way to write methods which work on arbitrary Tuples through some standard methods (elements(): int; element(int) -> object). It's that cliche interview question "implement a tree", updated for guava "how would you reimplement a popular guava class so as to get independence from guava releases and the ability to make it a return type in a public api" Anyway, good to see the change is in. The next step would be to have a baseline 3.x.y dependency as a minimum. steve > > On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng > wrote: > >> +1. >> >> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li >> wrote: >> >>> +1. >>> >>> Xiao >>> >>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: >>> I'm OK with this. It simplifies maintenance a bit, and specifically may allow us to finally move off of the ancient version of Guava (?) On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun wrote: > Hi, All. > > I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > is still used by someone in the community or not. If it's not used or > not useful, > we may remove it from Apache Spark 3.4.0 release. > > > https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > > Here is the background of this question. > Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > Spark community has been building and releasing with Java 8 only. > I believe that the user applications also use Java8+ in these days. > Recently, I received the following message from the Hadoop PMC. > > > "if you really want to claim hadoop 2.x compatibility, then you > have to > > be building against java 7". Otherwise a lot of people with hadoop > 2.x > > clusters won't be able to run your code. If your projects are > java8+ > > only, then they are implicitly hadoop 3.1+, no matter what you use > > in your build. Hence: no need for branch-2 branches except > > to complicate your build/test/release processes [1] > > If Hadoop2 binary distribution is no longer used as of today, > or incomplete somewhere due to Java 8 building, the following three > existing alternative Hadoop 3 binary distributions could be > the better official solution for old Hadoop 2 clusters. > > 1) Scala 2.12 and without-hadoop distribution > 2) Scala 2.12 and Hadoop 3 distribution > 3) Scala 2.13 and Hadoop 3 distribution > > In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 > Binary distribution? > > Dongjoon > > [1] > https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 > >>> >>> -- >>> >>>
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
Thank you all. SPARK-40651 is merged to Apache Spark master branch for Apache Spark 3.4.0 now. Dongjoon. On Wed, Oct 5, 2022 at 3:24 PM L. C. Hsieh wrote: > +1 > > Thanks Dongjoon. > > On Wed, Oct 5, 2022 at 3:11 PM Jungtaek Lim > wrote: > > > > +1 > > > > On Thu, Oct 6, 2022 at 5:59 AM Chao Sun wrote: > >> > >> +1 > >> > >> > and specifically may allow us to finally move off of the ancient > version of Guava (?) > >> > >> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > >> > >> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng > wrote: > >>> > >>> +1. > >>> > >>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li > wrote: > > +1. > > Xiao > > On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: > > > > I'm OK with this. It simplifies maintenance a bit, and specifically > may allow us to finally move off of the ancient version of Guava (?) > > > > On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun < > dongjoon.h...@gmail.com> wrote: > >> > >> Hi, All. > >> > >> I'm wondering if the following Apache Spark Hadoop2 Binary > Distribution > >> is still used by someone in the community or not. If it's not used > or not useful, > >> we may remove it from Apache Spark 3.4.0 release. > >> > >> > https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > >> > >> Here is the background of this question. > >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > >> Spark community has been building and releasing with Java 8 only. > >> I believe that the user applications also use Java8+ in these days. > >> Recently, I received the following message from the Hadoop PMC. > >> > >> > "if you really want to claim hadoop 2.x compatibility, then you > have to > >> > be building against java 7". Otherwise a lot of people with > hadoop 2.x > >> > clusters won't be able to run your code. If your projects are > java8+ > >> > only, then they are implicitly hadoop 3.1+, no matter what you > use > >> > in your build. Hence: no need for branch-2 branches except > >> > to complicate your build/test/release processes [1] > >> > >> If Hadoop2 binary distribution is no longer used as of today, > >> or incomplete somewhere due to Java 8 building, the following three > >> existing alternative Hadoop 3 binary distributions could be > >> the better official solution for old Hadoop 2 clusters. > >> > >> 1) Scala 2.12 and without-hadoop distribution > >> 2) Scala 2.12 and Hadoop 3 distribution > >> 3) Scala 2.13 and Hadoop 3 distribution > >> > >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 > Binary distribution? > >> > >> Dongjoon > >> > >> [1] > https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 > > > > -- > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
+1 Thanks Dongjoon. On Wed, Oct 5, 2022 at 3:11 PM Jungtaek Lim wrote: > > +1 > > On Thu, Oct 6, 2022 at 5:59 AM Chao Sun wrote: >> >> +1 >> >> > and specifically may allow us to finally move off of the ancient version >> > of Guava (?) >> >> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. >> >> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng wrote: >>> >>> +1. >>> >>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li >>> wrote: +1. Xiao On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: > > I'm OK with this. It simplifies maintenance a bit, and specifically may > allow us to finally move off of the ancient version of Guava (?) > > On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun > wrote: >> >> Hi, All. >> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution >> is still used by someone in the community or not. If it's not used or >> not useful, >> we may remove it from Apache Spark 3.4.0 release. >> >> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz >> >> Here is the background of this question. >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache >> Spark community has been building and releasing with Java 8 only. >> I believe that the user applications also use Java8+ in these days. >> Recently, I received the following message from the Hadoop PMC. >> >> > "if you really want to claim hadoop 2.x compatibility, then you have >> to >> > be building against java 7". Otherwise a lot of people with hadoop >> 2.x >> > clusters won't be able to run your code. If your projects are java8+ >> > only, then they are implicitly hadoop 3.1+, no matter what you use >> > in your build. Hence: no need for branch-2 branches except >> > to complicate your build/test/release processes [1] >> >> If Hadoop2 binary distribution is no longer used as of today, >> or incomplete somewhere due to Java 8 building, the following three >> existing alternative Hadoop 3 binary distributions could be >> the better official solution for old Hadoop 2 clusters. >> >> 1) Scala 2.12 and without-hadoop distribution >> 2) Scala 2.12 and Hadoop 3 distribution >> 3) Scala 2.13 and Hadoop 3 distribution >> >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary >> distribution? >> >> Dongjoon >> >> [1] >> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 -- - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
+1 On Thu, Oct 6, 2022 at 5:59 AM Chao Sun wrote: > +1 > > > and specifically may allow us to finally move off of the ancient version > of Guava (?) > > I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. > > On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng > wrote: > >> +1. >> >> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li >> wrote: >> >>> +1. >>> >>> Xiao >>> >>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: >>> I'm OK with this. It simplifies maintenance a bit, and specifically may allow us to finally move off of the ancient version of Guava (?) On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun wrote: > Hi, All. > > I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > is still used by someone in the community or not. If it's not used or > not useful, > we may remove it from Apache Spark 3.4.0 release. > > > https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > > Here is the background of this question. > Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > Spark community has been building and releasing with Java 8 only. > I believe that the user applications also use Java8+ in these days. > Recently, I received the following message from the Hadoop PMC. > > > "if you really want to claim hadoop 2.x compatibility, then you > have to > > be building against java 7". Otherwise a lot of people with hadoop > 2.x > > clusters won't be able to run your code. If your projects are > java8+ > > only, then they are implicitly hadoop 3.1+, no matter what you use > > in your build. Hence: no need for branch-2 branches except > > to complicate your build/test/release processes [1] > > If Hadoop2 binary distribution is no longer used as of today, > or incomplete somewhere due to Java 8 building, the following three > existing alternative Hadoop 3 binary distributions could be > the better official solution for old Hadoop 2 clusters. > > 1) Scala 2.12 and without-hadoop distribution > 2) Scala 2.12 and Hadoop 3 distribution > 3) Scala 2.13 and Hadoop 3 distribution > > In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 > Binary distribution? > > Dongjoon > > [1] > https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 > >>> >>> -- >>> >>>
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
+1 > and specifically may allow us to finally move off of the ancient version of Guava (?) I think the Guava issue comes from Hive 2.3 dependency, not Hadoop. On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng wrote: > +1. > > On Wed, Oct 5, 2022 at 1:53 PM Xiao Li > wrote: > >> +1. >> >> Xiao >> >> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: >> >>> I'm OK with this. It simplifies maintenance a bit, and specifically may >>> allow us to finally move off of the ancient version of Guava (?) >>> >>> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun >>> wrote: >>> Hi, All. I'm wondering if the following Apache Spark Hadoop2 Binary Distribution is still used by someone in the community or not. If it's not used or not useful, we may remove it from Apache Spark 3.4.0 release. https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz Here is the background of this question. Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache Spark community has been building and releasing with Java 8 only. I believe that the user applications also use Java8+ in these days. Recently, I received the following message from the Hadoop PMC. > "if you really want to claim hadoop 2.x compatibility, then you have to > be building against java 7". Otherwise a lot of people with hadoop 2.x > clusters won't be able to run your code. If your projects are java8+ > only, then they are implicitly hadoop 3.1+, no matter what you use > in your build. Hence: no need for branch-2 branches except > to complicate your build/test/release processes [1] If Hadoop2 binary distribution is no longer used as of today, or incomplete somewhere due to Java 8 building, the following three existing alternative Hadoop 3 binary distributions could be the better official solution for old Hadoop 2 clusters. 1) Scala 2.12 and without-hadoop distribution 2) Scala 2.12 and Hadoop 3 distribution 3) Scala 2.13 and Hadoop 3 distribution In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary distribution? Dongjoon [1] https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 >>> >> >> -- >> >>
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
+1. On Wed, Oct 5, 2022 at 1:53 PM Xiao Li wrote: > +1. > > Xiao > > On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: > >> I'm OK with this. It simplifies maintenance a bit, and specifically may >> allow us to finally move off of the ancient version of Guava (?) >> >> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun >> wrote: >> >>> Hi, All. >>> >>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution >>> is still used by someone in the community or not. If it's not used or >>> not useful, >>> we may remove it from Apache Spark 3.4.0 release. >>> >>> >>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz >>> >>> Here is the background of this question. >>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache >>> Spark community has been building and releasing with Java 8 only. >>> I believe that the user applications also use Java8+ in these days. >>> Recently, I received the following message from the Hadoop PMC. >>> >>> > "if you really want to claim hadoop 2.x compatibility, then you have >>> to >>> > be building against java 7". Otherwise a lot of people with hadoop >>> 2.x >>> > clusters won't be able to run your code. If your projects are java8+ >>> > only, then they are implicitly hadoop 3.1+, no matter what you use >>> > in your build. Hence: no need for branch-2 branches except >>> > to complicate your build/test/release processes [1] >>> >>> If Hadoop2 binary distribution is no longer used as of today, >>> or incomplete somewhere due to Java 8 building, the following three >>> existing alternative Hadoop 3 binary distributions could be >>> the better official solution for old Hadoop 2 clusters. >>> >>> 1) Scala 2.12 and without-hadoop distribution >>> 2) Scala 2.12 and Hadoop 3 distribution >>> 3) Scala 2.13 and Hadoop 3 distribution >>> >>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary >>> distribution? >>> >>> Dongjoon >>> >>> [1] >>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 >>> >> > > -- > >
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
+1. Xiao On Wed, Oct 5, 2022 at 12:49 PM Sean Owen wrote: > I'm OK with this. It simplifies maintenance a bit, and specifically may > allow us to finally move off of the ancient version of Guava (?) > > On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun > wrote: > >> Hi, All. >> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution >> is still used by someone in the community or not. If it's not used or not >> useful, >> we may remove it from Apache Spark 3.4.0 release. >> >> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz >> >> Here is the background of this question. >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache >> Spark community has been building and releasing with Java 8 only. >> I believe that the user applications also use Java8+ in these days. >> Recently, I received the following message from the Hadoop PMC. >> >> > "if you really want to claim hadoop 2.x compatibility, then you have >> to >> > be building against java 7". Otherwise a lot of people with hadoop 2.x >> > clusters won't be able to run your code. If your projects are java8+ >> > only, then they are implicitly hadoop 3.1+, no matter what you use >> > in your build. Hence: no need for branch-2 branches except >> > to complicate your build/test/release processes [1] >> >> If Hadoop2 binary distribution is no longer used as of today, >> or incomplete somewhere due to Java 8 building, the following three >> existing alternative Hadoop 3 binary distributions could be >> the better official solution for old Hadoop 2 clusters. >> >> 1) Scala 2.12 and without-hadoop distribution >> 2) Scala 2.12 and Hadoop 3 distribution >> 3) Scala 2.13 and Hadoop 3 distribution >> >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary >> distribution? >> >> Dongjoon >> >> [1] >> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 >> > --
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
I'm OK with this. It simplifies maintenance a bit, and specifically may allow us to finally move off of the ancient version of Guava (?) On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun wrote: > Hi, All. > > I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > is still used by someone in the community or not. If it's not used or not > useful, > we may remove it from Apache Spark 3.4.0 release. > > > https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > > Here is the background of this question. > Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > Spark community has been building and releasing with Java 8 only. > I believe that the user applications also use Java8+ in these days. > Recently, I received the following message from the Hadoop PMC. > > > "if you really want to claim hadoop 2.x compatibility, then you have to > > be building against java 7". Otherwise a lot of people with hadoop 2.x > > clusters won't be able to run your code. If your projects are java8+ > > only, then they are implicitly hadoop 3.1+, no matter what you use > > in your build. Hence: no need for branch-2 branches except > > to complicate your build/test/release processes [1] > > If Hadoop2 binary distribution is no longer used as of today, > or incomplete somewhere due to Java 8 building, the following three > existing alternative Hadoop 3 binary distributions could be > the better official solution for old Hadoop 2 clusters. > > 1) Scala 2.12 and without-hadoop distribution > 2) Scala 2.12 and Hadoop 3 distribution > 3) Scala 2.13 and Hadoop 3 distribution > > In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary > distribution? > > Dongjoon > > [1] > https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 >
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
Thank you for your feedback and support, YangJie and Steve. For the internally-built Hadoop clusters, I believe internally-built Spark distribution with the corresponding custom Hadoop will be the best solution instead of Apache Spark with Apache Hadoop 2.7.4 client to have the full internal changes. I opened a PR to make this thread visible in Apache Spark 3.4.0. SPARK-40651 Drop Hadoop2 binary distribution from release process https://github.com/apache/spark/pull/38099 Dongjoon. On 2022/10/04 19:32:52 Dongjoon Hyun wrote: > Yes, it's yours. I added you (Steve Loughran ) as BCC at > the first email, Steve. :) > > Dongjoon. > > On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran wrote: > > > > > that sounds suspiciously like something I'd write :) > > > > the move to java8 happened in HADOOP-11858; 3.0.0 > > > > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been > > open since 2019 and I just closed as WONTFIX. > > > > Most of the big production hadoop 2 clusters use java7, because that is > > what they were deployed with and if you are upgrading java versions then > > you'd want to upgrade to a java8 version of guava -with fixes, java8 > > version of jackson -with fixes, and at that point "upgrade the cluster" > > becomes the strategy. > > > > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's > > not enough to set hadoop.version in the build, it needs full integration > > testing with all those long-out-of-date transitive dependencies. And who > > does that? nobody. > > > > > > Does still claiming to support hadoop-2 cause any problems? Yes, because > > it forces anything which wants to use more recent APIs either to play > > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only > > source trees (spark-hadoop-cloud), or stay stuck using older > > classes/methods for no real benefit. Finally: what are you going to do if > > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is > > anyone really going to care? > > > > Where this really frustrates me is in the libraries used downstream which > > worry about java11, java17 compatibility etc still set hadoop.version to be > > 2.10, even though it blocks them from basic improvements, such as skipping > > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594). > > which transitively hurts iceberg, because it uses avro for its manifests, > > doesn't it? > > > > As for the cutting edge stuff...anyone at ApacheCon reading this email on > > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will > > be presenting the results of hive using the vector IO version of ORC, and > > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks > > (300G). That doesn't need hive changes, just a build of ORC using the new > > API for async/parallel fetch of stripes. The parquet support with spark > > benchmarks is still a WiP, but I would expect to see similar numbers, and > > again, no changes to spark, just parquet > > > > And as the JMH microbenchmarks against the raw local FS show a 20x speedup > > in reads (async fetch into direct buffers), anyone running spark on a > > laptop should see some speedups too. > > > > Cloudera can ship this stuff internally. But the ASF projects are all > > stuck in time because of the belief that building against branch-2 makes > > sense. And it is transitive. Hive's requirements hold back iceberg, for > > example. (see also , PARQUET-2173. ...) > > > > If you want your applications to work better, especially in cloud, you > > should not just be running on a modern version of hadoop (and java11+, > > ideally), you *and your libraries* should be using the newer APIs to work > > with the data. > > > > Finally note that while that scatter/gather read call will only be on > > 3.3.5 we are doing a shim lib to offer the API to apps on older builds > > -it'll use readFully() to do the reads, just as the default implementation > > on all filesystems does on hadoop 3.3.5. See > > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop > > extension lib. One which will not run on hadoop-2, but 3.2.x+ only. > > Obviously > > > > steve > > > > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun > > wrote: > > > >> Hi, All. > >> > >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > >> is still used by someone in the community or not. If it's not used or not > >> useful, > >> we may remove it from Apache Spark 3.4.0 release. > >> > >> > >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > >> > >> Here is the background of this question. > >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > >> Spark community has been building and releasing with Java 8 only. > >> I believe that the user applications also use Java8+ in these days. > >> Recently, I received the following message from the Hadoop PMC. > >> > >> > "if you really want to claim
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
Yes, it's yours. I added you (Steve Loughran ) as BCC at the first email, Steve. :) Dongjoon. On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran wrote: > > that sounds suspiciously like something I'd write :) > > the move to java8 happened in HADOOP-11858; 3.0.0 > > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been > open since 2019 and I just closed as WONTFIX. > > Most of the big production hadoop 2 clusters use java7, because that is > what they were deployed with and if you are upgrading java versions then > you'd want to upgrade to a java8 version of guava -with fixes, java8 > version of jackson -with fixes, and at that point "upgrade the cluster" > becomes the strategy. > > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's > not enough to set hadoop.version in the build, it needs full integration > testing with all those long-out-of-date transitive dependencies. And who > does that? nobody. > > > Does still claiming to support hadoop-2 cause any problems? Yes, because > it forces anything which wants to use more recent APIs either to play > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only > source trees (spark-hadoop-cloud), or stay stuck using older > classes/methods for no real benefit. Finally: what are you going to do if > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is > anyone really going to care? > > Where this really frustrates me is in the libraries used downstream which > worry about java11, java17 compatibility etc still set hadoop.version to be > 2.10, even though it blocks them from basic improvements, such as skipping > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594). > which transitively hurts iceberg, because it uses avro for its manifests, > doesn't it? > > As for the cutting edge stuff...anyone at ApacheCon reading this email on > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will > be presenting the results of hive using the vector IO version of ORC, and > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks > (300G). That doesn't need hive changes, just a build of ORC using the new > API for async/parallel fetch of stripes. The parquet support with spark > benchmarks is still a WiP, but I would expect to see similar numbers, and > again, no changes to spark, just parquet > > And as the JMH microbenchmarks against the raw local FS show a 20x speedup > in reads (async fetch into direct buffers), anyone running spark on a > laptop should see some speedups too. > > Cloudera can ship this stuff internally. But the ASF projects are all > stuck in time because of the belief that building against branch-2 makes > sense. And it is transitive. Hive's requirements hold back iceberg, for > example. (see also , PARQUET-2173. ...) > > If you want your applications to work better, especially in cloud, you > should not just be running on a modern version of hadoop (and java11+, > ideally), you *and your libraries* should be using the newer APIs to work > with the data. > > Finally note that while that scatter/gather read call will only be on > 3.3.5 we are doing a shim lib to offer the API to apps on older builds > -it'll use readFully() to do the reads, just as the default implementation > on all filesystems does on hadoop 3.3.5. See > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop > extension lib. One which will not run on hadoop-2, but 3.2.x+ only. > Obviously > > steve > > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun > wrote: > >> Hi, All. >> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution >> is still used by someone in the community or not. If it's not used or not >> useful, >> we may remove it from Apache Spark 3.4.0 release. >> >> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz >> >> Here is the background of this question. >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache >> Spark community has been building and releasing with Java 8 only. >> I believe that the user applications also use Java8+ in these days. >> Recently, I received the following message from the Hadoop PMC. >> >> > "if you really want to claim hadoop 2.x compatibility, then you have >> to >> > be building against java 7". Otherwise a lot of people with hadoop 2.x >> > clusters won't be able to run your code. If your projects are java8+ >> > only, then they are implicitly hadoop 3.1+, no matter what you use >> > in your build. Hence: no need for branch-2 branches except >> > to complicate your build/test/release processes [1] >> >> If Hadoop2 binary distribution is no longer used as of today, >> or incomplete somewhere due to Java 8 building, the following three >> existing alternative Hadoop 3 binary distributions could be >> the better official solution for old Hadoop 2 clusters. >> >> 1) Scala 2.12 and without-hadoop distribution >> 2) Scala
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
that sounds suspiciously like something I'd write :) the move to java8 happened in HADOOP-11858; 3.0.0 HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been open since 2019 and I just closed as WONTFIX. Most of the big production hadoop 2 clusters use java7, because that is what they were deployed with and if you are upgrading java versions then you'd want to upgrade to a java8 version of guava -with fixes, java8 version of jackson -with fixes, and at that point "upgrade the cluster" becomes the strategy. If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's not enough to set hadoop.version in the build, it needs full integration testing with all those long-out-of-date transitive dependencies. And who does that? nobody. Does still claiming to support hadoop-2 cause any problems? Yes, because it forces anything which wants to use more recent APIs either to play reflection games (SparkHadoopUtil.createFile()...) have branch-3 only source trees (spark-hadoop-cloud), or stay stuck using older classes/methods for no real benefit. Finally: what are you going to do if someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is anyone really going to care? Where this really frustrates me is in the libraries used downstream which worry about java11, java17 compatibility etc still set hadoop.version to be 2.10, even though it blocks them from basic improvements, such as skipping a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594). which transitively hurts iceberg, because it uses avro for its manifests, doesn't it? As for the cutting edge stuff...anyone at ApacheCon reading this email on oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will be presenting the results of hive using the vector IO version of ORC, and seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks (300G). That doesn't need hive changes, just a build of ORC using the new API for async/parallel fetch of stripes. The parquet support with spark benchmarks is still a WiP, but I would expect to see similar numbers, and again, no changes to spark, just parquet And as the JMH microbenchmarks against the raw local FS show a 20x speedup in reads (async fetch into direct buffers), anyone running spark on a laptop should see some speedups too. Cloudera can ship this stuff internally. But the ASF projects are all stuck in time because of the belief that building against branch-2 makes sense. And it is transitive. Hive's requirements hold back iceberg, for example. (see also , PARQUET-2173. ...) If you want your applications to work better, especially in cloud, you should not just be running on a modern version of hadoop (and java11+, ideally), you *and your libraries* should be using the newer APIs to work with the data. Finally note that while that scatter/gather read call will only be on 3.3.5 we are doing a shim lib to offer the API to apps on older builds -it'll use readFully() to do the reads, just as the default implementation on all filesystems does on hadoop 3.3.5. See https://github.com/steveloughran/fs-api-shim ; it will become a hadoop extension lib. One which will not run on hadoop-2, but 3.2.x+ only. Obviously steve On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun wrote: > Hi, All. > > I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > is still used by someone in the community or not. If it's not used or not > useful, > we may remove it from Apache Spark 3.4.0 release. > > > https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > > Here is the background of this question. > Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > Spark community has been building and releasing with Java 8 only. > I believe that the user applications also use Java8+ in these days. > Recently, I received the following message from the Hadoop PMC. > > > "if you really want to claim hadoop 2.x compatibility, then you have to > > be building against java 7". Otherwise a lot of people with hadoop 2.x > > clusters won't be able to run your code. If your projects are java8+ > > only, then they are implicitly hadoop 3.1+, no matter what you use > > in your build. Hence: no need for branch-2 branches except > > to complicate your build/test/release processes [1] > > If Hadoop2 binary distribution is no longer used as of today, > or incomplete somewhere due to Java 8 building, the following three > existing alternative Hadoop 3 binary distributions could be > the better official solution for old Hadoop 2 clusters. > > 1) Scala 2.12 and without-hadoop distribution > 2) Scala 2.12 and Hadoop 3 distribution > 3) Scala 2.13 and Hadoop 3 distribution > > In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary > distribution? > > Dongjoon > > [1] >
Re: Dropping Apache Spark Hadoop2 Binary Distribution?
Hi, Dongjoon Our company(Baidu) is still using the combination of Spark 3.3 + Hadoop 2.7.4 in the production environment. Hadoop 2.7.4 is an internally maintained version compiled by Java 8. Although we are using Hadoop 2, I still support this proposal because it is positive and exciting. Regards, YangJie 发件人: Dongjoon Hyun 日期: 2022年10月4日 星期二 11:16 收件人: dev 主题: Dropping Apache Spark Hadoop2 Binary Distribution? Hi, All. I'm wondering if the following Apache Spark Hadoop2 Binary Distribution is still used by someone in the community or not. If it's not used or not useful, we may remove it from Apache Spark 3.4.0 release. https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz<https://mailshield.baidu.com/check?q=nFKjwur0WPBgNfrarJ1k%2fUbMkNasnbh1TmZiNzBvSuAAb596rlYk182hUiEqyXWjksmdGeptL3s8ghXMv%2buNxwrpF0RZUXK4QQKzVPN3u3Q%3d> Here is the background of this question. Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache Spark community has been building and releasing with Java 8 only. I believe that the user applications also use Java8+ in these days. Recently, I received the following message from the Hadoop PMC. > "if you really want to claim hadoop 2.x compatibility, then you have to > be building against java 7". Otherwise a lot of people with hadoop 2.x > clusters won't be able to run your code. If your projects are java8+ > only, then they are implicitly hadoop 3.1+, no matter what you use > in your build. Hence: no need for branch-2 branches except > to complicate your build/test/release processes [1] If Hadoop2 binary distribution is no longer used as of today, or incomplete somewhere due to Java 8 building, the following three existing alternative Hadoop 3 binary distributions could be the better official solution for old Hadoop 2 clusters. 1) Scala 2.12 and without-hadoop distribution 2) Scala 2.12 and Hadoop 3 distribution 3) Scala 2.13 and Hadoop 3 distribution In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary distribution? Dongjoon [1] https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247<https://mailshield.baidu.com/check?q=ydfs6JNIgVYX0c7s35hEbDKduTWJZfdqBlri9w1eAUmmi3MLIwhMNIpBPI11b4Ue4yyJduNrNLK%2bO6wv0EJEtYrfL79ZSK18xbM73fm3xOMIk17zxsTfggWFeJdpVDezLVjcWYU0dEW42Y%2bQGV6D7%2fdI48KLX9PGGjGB%2fy8OdRIr%2fu3WQWqH9dNa8Zmn4WvJib9TNaozHE4kzjjZrx8BAJkuUxTlBZOg>
Dropping Apache Spark Hadoop2 Binary Distribution?
Hi, All. I'm wondering if the following Apache Spark Hadoop2 Binary Distribution is still used by someone in the community or not. If it's not used or not useful, we may remove it from Apache Spark 3.4.0 release. https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz Here is the background of this question. Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache Spark community has been building and releasing with Java 8 only. I believe that the user applications also use Java8+ in these days. Recently, I received the following message from the Hadoop PMC. > "if you really want to claim hadoop 2.x compatibility, then you have to > be building against java 7". Otherwise a lot of people with hadoop 2.x > clusters won't be able to run your code. If your projects are java8+ > only, then they are implicitly hadoop 3.1+, no matter what you use > in your build. Hence: no need for branch-2 branches except > to complicate your build/test/release processes [1] If Hadoop2 binary distribution is no longer used as of today, or incomplete somewhere due to Java 8 building, the following three existing alternative Hadoop 3 binary distributions could be the better official solution for old Hadoop 2 clusters. 1) Scala 2.12 and without-hadoop distribution 2) Scala 2.12 and Hadoop 3 distribution 3) Scala 2.13 and Hadoop 3 distribution In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary distribution? Dongjoon [1] https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247