Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-06 Thread Steve Loughran
On Wed, 5 Oct 2022 at 21:59, Chao Sun  wrote:

> +1
>
> > and specifically may allow us to finally move off of the ancient version
> of Guava (?)
>
> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>

hadoop branch-2 has guava dependencies; not sure which one

A key lesson there is "never trust google artifacts to be stable at the
binary level"

Which is a shame, especially as there are some things in the jar (executors
in particular) for which there is still no comparable java equivalent.

Oh, we've also learned never to export *any* third party class in a public
API if possible.
Which is also a shame as java language lacks any form of tuple type and I
do not want to reimplement all of that. Java 17 records would suffice,
though as there's no java.lang.Tuple base type, no way to write methods
which work on arbitrary Tuples through some standard methods (elements():
int; element(int) -> object).

It's that cliche interview question "implement a tree", updated for guava
"how would you reimplement a popular guava class so as to get independence
from guava releases and the ability to make it a return type in a public
api"

Anyway, good to see the change is in. The next step would be to have a
baseline 3.x.y dependency as a minimum.

steve

>
> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
> wrote:
>
>> +1.
>>
>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
>> wrote:
>>
>>> +1.
>>>
>>> Xiao
>>>
>>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>>>
 I'm OK with this. It simplifies maintenance a bit, and specifically may
 allow us to finally move off of the ancient version of Guava (?)

 On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or
> not useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you
> have to
>   > be building against java 7". Otherwise a lot of people with hadoop
> 2.x
>   > clusters won't be able to run your code. If your projects are
> java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
> Binary distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>

>>>
>>> --
>>>
>>>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Dongjoon Hyun
Thank you all.

SPARK-40651 is merged to Apache Spark master branch for Apache Spark 3.4.0
now.

Dongjoon.

On Wed, Oct 5, 2022 at 3:24 PM L. C. Hsieh  wrote:

> +1
>
> Thanks Dongjoon.
>
> On Wed, Oct 5, 2022 at 3:11 PM Jungtaek Lim
>  wrote:
> >
> > +1
> >
> > On Thu, Oct 6, 2022 at 5:59 AM Chao Sun  wrote:
> >>
> >> +1
> >>
> >> > and specifically may allow us to finally move off of the ancient
> version of Guava (?)
> >>
> >> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
> >>
> >> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
> wrote:
> >>>
> >>> +1.
> >>>
> >>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
> wrote:
> 
>  +1.
> 
>  Xiao
> 
>  On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
> >
> > I'm OK with this. It simplifies maintenance a bit, and specifically
> may allow us to finally move off of the ancient version of Guava (?)
> >
> > On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> >>
> >> Hi, All.
> >>
> >> I'm wondering if the following Apache Spark Hadoop2 Binary
> Distribution
> >> is still used by someone in the community or not. If it's not used
> or not useful,
> >> we may remove it from Apache Spark 3.4.0 release.
> >>
> >>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
> >>
> >> Here is the background of this question.
> >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> >> Spark community has been building and releasing with Java 8 only.
> >> I believe that the user applications also use Java8+ in these days.
> >> Recently, I received the following message from the Hadoop PMC.
> >>
> >>   > "if you really want to claim hadoop 2.x compatibility, then you
> have to
> >>   > be building against java 7". Otherwise a lot of people with
> hadoop 2.x
> >>   > clusters won't be able to run your code. If your projects are
> java8+
> >>   > only, then they are implicitly hadoop 3.1+, no matter what you
> use
> >>   > in your build. Hence: no need for branch-2 branches except
> >>   > to complicate your build/test/release processes [1]
> >>
> >> If Hadoop2 binary distribution is no longer used as of today,
> >> or incomplete somewhere due to Java 8 building, the following three
> >> existing alternative Hadoop 3 binary distributions could be
> >> the better official solution for old Hadoop 2 clusters.
> >>
> >> 1) Scala 2.12 and without-hadoop distribution
> >> 2) Scala 2.12 and Hadoop 3 distribution
> >> 3) Scala 2.13 and Hadoop 3 distribution
> >>
> >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
> Binary distribution?
> >>
> >> Dongjoon
> >>
> >> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
> 
> 
> 
>  --
> 
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread L. C. Hsieh
+1

Thanks Dongjoon.

On Wed, Oct 5, 2022 at 3:11 PM Jungtaek Lim
 wrote:
>
> +1
>
> On Thu, Oct 6, 2022 at 5:59 AM Chao Sun  wrote:
>>
>> +1
>>
>> > and specifically may allow us to finally move off of the ancient version 
>> > of Guava (?)
>>
>> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>>
>> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng  wrote:
>>>
>>> +1.
>>>
>>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li  
>>> wrote:

 +1.

 Xiao

 On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>
> I'm OK with this. It simplifies maintenance a bit, and specifically may 
> allow us to finally move off of the ancient version of Guava (?)
>
> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun  
> wrote:
>>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or 
>> not useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>> 
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have 
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 
>> 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>> 1) Scala 2.12 and without-hadoop distribution
>> 2) Scala 2.12 and Hadoop 3 distribution
>> 3) Scala 2.13 and Hadoop 3 distribution
>>
>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary 
>> distribution?
>>
>> Dongjoon
>>
>> [1] 
>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247



 --


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Jungtaek Lim
+1

On Thu, Oct 6, 2022 at 5:59 AM Chao Sun  wrote:

> +1
>
> > and specifically may allow us to finally move off of the ancient version
> of Guava (?)
>
> I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.
>
> On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
> wrote:
>
>> +1.
>>
>> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
>> wrote:
>>
>>> +1.
>>>
>>> Xiao
>>>
>>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>>>
 I'm OK with this. It simplifies maintenance a bit, and specifically may
 allow us to finally move off of the ancient version of Guava (?)

 On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or
> not useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you
> have to
>   > be building against java 7". Otherwise a lot of people with hadoop
> 2.x
>   > clusters won't be able to run your code. If your projects are
> java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
> Binary distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>

>>>
>>> --
>>>
>>>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Chao Sun
+1

> and specifically may allow us to finally move off of the ancient version
of Guava (?)

I think the Guava issue comes from Hive 2.3 dependency, not Hadoop.

On Wed, Oct 5, 2022 at 1:55 PM Xinrong Meng 
wrote:

> +1.
>
> On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
> wrote:
>
>> +1.
>>
>> Xiao
>>
>> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>>
>>> I'm OK with this. It simplifies maintenance a bit, and specifically may
>>> allow us to finally move off of the ancient version of Guava (?)
>>>
>>> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
 is still used by someone in the community or not. If it's not used or
 not useful,
 we may remove it from Apache Spark 3.4.0 release.


 https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz

 Here is the background of this question.
 Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
 Spark community has been building and releasing with Java 8 only.
 I believe that the user applications also use Java8+ in these days.
 Recently, I received the following message from the Hadoop PMC.

   > "if you really want to claim hadoop 2.x compatibility, then you
 have to
   > be building against java 7". Otherwise a lot of people with hadoop
 2.x
   > clusters won't be able to run your code. If your projects are java8+
   > only, then they are implicitly hadoop 3.1+, no matter what you use
   > in your build. Hence: no need for branch-2 branches except
   > to complicate your build/test/release processes [1]

 If Hadoop2 binary distribution is no longer used as of today,
 or incomplete somewhere due to Java 8 building, the following three
 existing alternative Hadoop 3 binary distributions could be
 the better official solution for old Hadoop 2 clusters.

 1) Scala 2.12 and without-hadoop distribution
 2) Scala 2.12 and Hadoop 3 distribution
 3) Scala 2.13 and Hadoop 3 distribution

 In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2
 Binary distribution?

 Dongjoon

 [1]
 https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247

>>>
>>
>> --
>>
>>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Xinrong Meng
+1.

On Wed, Oct 5, 2022 at 1:53 PM Xiao Li 
wrote:

> +1.
>
> Xiao
>
> On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:
>
>> I'm OK with this. It simplifies maintenance a bit, and specifically may
>> allow us to finally move off of the ancient version of Guava (?)
>>
>> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>>> is still used by someone in the community or not. If it's not used or
>>> not useful,
>>> we may remove it from Apache Spark 3.4.0 release.
>>>
>>>
>>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>>
>>> Here is the background of this question.
>>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>>> Spark community has been building and releasing with Java 8 only.
>>> I believe that the user applications also use Java8+ in these days.
>>> Recently, I received the following message from the Hadoop PMC.
>>>
>>>   > "if you really want to claim hadoop 2.x compatibility, then you have
>>> to
>>>   > be building against java 7". Otherwise a lot of people with hadoop
>>> 2.x
>>>   > clusters won't be able to run your code. If your projects are java8+
>>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>>   > in your build. Hence: no need for branch-2 branches except
>>>   > to complicate your build/test/release processes [1]
>>>
>>> If Hadoop2 binary distribution is no longer used as of today,
>>> or incomplete somewhere due to Java 8 building, the following three
>>> existing alternative Hadoop 3 binary distributions could be
>>> the better official solution for old Hadoop 2 clusters.
>>>
>>> 1) Scala 2.12 and without-hadoop distribution
>>> 2) Scala 2.12 and Hadoop 3 distribution
>>> 3) Scala 2.13 and Hadoop 3 distribution
>>>
>>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
>>> distribution?
>>>
>>> Dongjoon
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>>>
>>
>
> --
>
>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Xiao Li
+1.

Xiao

On Wed, Oct 5, 2022 at 12:49 PM Sean Owen  wrote:

> I'm OK with this. It simplifies maintenance a bit, and specifically may
> allow us to finally move off of the ancient version of Guava (?)
>
> On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or not
>> useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>>
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>> 1) Scala 2.12 and without-hadoop distribution
>> 2) Scala 2.12 and Hadoop 3 distribution
>> 3) Scala 2.13 and Hadoop 3 distribution
>>
>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
>> distribution?
>>
>> Dongjoon
>>
>> [1]
>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>>
>

--


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-05 Thread Sean Owen
I'm OK with this. It simplifies maintenance a bit, and specifically may
allow us to finally move off of the ancient version of Guava (?)

On Mon, Oct 3, 2022 at 10:16 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or not
> useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you have to
>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>   > clusters won't be able to run your code. If your projects are java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
> distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>


Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-04 Thread Dongjoon Hyun
Thank you for your feedback and support, YangJie and Steve.

For the internally-built Hadoop clusters, I believe internally-built Spark 
distribution with the corresponding custom Hadoop will be the best solution 
instead of Apache Spark with Apache Hadoop 2.7.4 client to have the full 
internal changes.

I opened a PR to make this thread visible in Apache Spark 3.4.0.

SPARK-40651 Drop Hadoop2 binary distribution from release process
https://github.com/apache/spark/pull/38099

Dongjoon.

On 2022/10/04 19:32:52 Dongjoon Hyun wrote:
> Yes, it's yours. I added you (Steve Loughran ) as BCC at
> the first email, Steve. :)
> 
> Dongjoon.
> 
> On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran  wrote:
> 
> >
> > that sounds suspiciously like something I'd write :)
> >
> > the move to java8 happened in HADOOP-11858; 3.0.0
> >
> > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
> > open since 2019 and I just closed as WONTFIX.
> >
> > Most of the big production hadoop 2 clusters use java7, because that is
> > what they were deployed with and if you are upgrading java versions then
> > you'd want to upgrade to a java8 version of guava -with fixes, java8
> > version of jackson -with fixes, and at that point "upgrade the cluster"
> > becomes the strategy.
> >
> > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
> > not enough to set hadoop.version in the build, it needs full integration
> > testing with all those long-out-of-date transitive dependencies. And who
> > does that? nobody.
> >
> >
> > Does still claiming to support hadoop-2 cause any problems? Yes, because
> > it forces anything which wants to use more recent APIs either to play
> > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
> > source trees (spark-hadoop-cloud), or stay stuck using older
> > classes/methods for no real benefit. Finally: what are you going to do if
> > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
> > anyone really going to care?
> >
> > Where this really frustrates me is in the libraries used downstream which
> > worry about java11, java17 compatibility etc still set hadoop.version to be
> > 2.10, even though it blocks them from basic improvements, such as skipping
> > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
> > which transitively hurts iceberg, because it uses avro for its manifests,
> > doesn't it?
> >
> > As for the cutting edge stuff...anyone at ApacheCon reading this email on
> > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
> > be presenting the results of hive using the vector IO version of ORC, and
> > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
> > (300G). That doesn't need hive changes, just a build of ORC using the new
> > API for async/parallel fetch of stripes. The parquet support with spark
> > benchmarks is still a WiP, but I would expect to see similar numbers, and
> > again, no changes to spark, just parquet
> >
> > And as the JMH microbenchmarks against the raw local FS show a 20x speedup
> > in reads (async fetch into direct buffers), anyone running spark on a
> > laptop should see some speedups too.
> >
> > Cloudera can ship this stuff internally. But the ASF projects are all
> > stuck in time because of the belief that building against branch-2 makes
> > sense. And it is transitive. Hive's requirements hold back iceberg, for
> > example. (see also , PARQUET-2173. ...)
> >
> > If you want your applications to work better, especially in cloud, you
> > should not just be running on a modern version of hadoop (and java11+,
> > ideally), you *and your libraries* should be using the newer APIs to work
> > with the data.
> >
> > Finally note that while that scatter/gather read call will only be on
> > 3.3.5 we are doing a shim lib to offer the API to apps on older builds
> > -it'll use readFully() to do the reads, just as the default implementation
> > on all filesystems does on hadoop 3.3.5. See
> > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
> > extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
> > Obviously
> >
> > steve
> >
> > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun 
> > wrote:
> >
> >> Hi, All.
> >>
> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> >> is still used by someone in the community or not. If it's not used or not
> >> useful,
> >> we may remove it from Apache Spark 3.4.0 release.
> >>
> >>
> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
> >>
> >> Here is the background of this question.
> >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> >> Spark community has been building and releasing with Java 8 only.
> >> I believe that the user applications also use Java8+ in these days.
> >> Recently, I received the following message from the Hadoop PMC.
> >>
> >>   > "if you really want to claim 

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-04 Thread Dongjoon Hyun
Yes, it's yours. I added you (Steve Loughran ) as BCC at
the first email, Steve. :)

Dongjoon.

On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran  wrote:

>
> that sounds suspiciously like something I'd write :)
>
> the move to java8 happened in HADOOP-11858; 3.0.0
>
> HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
> open since 2019 and I just closed as WONTFIX.
>
> Most of the big production hadoop 2 clusters use java7, because that is
> what they were deployed with and if you are upgrading java versions then
> you'd want to upgrade to a java8 version of guava -with fixes, java8
> version of jackson -with fixes, and at that point "upgrade the cluster"
> becomes the strategy.
>
> If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
> not enough to set hadoop.version in the build, it needs full integration
> testing with all those long-out-of-date transitive dependencies. And who
> does that? nobody.
>
>
> Does still claiming to support hadoop-2 cause any problems? Yes, because
> it forces anything which wants to use more recent APIs either to play
> reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
> source trees (spark-hadoop-cloud), or stay stuck using older
> classes/methods for no real benefit. Finally: what are you going to do if
> someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
> anyone really going to care?
>
> Where this really frustrates me is in the libraries used downstream which
> worry about java11, java17 compatibility etc still set hadoop.version to be
> 2.10, even though it blocks them from basic improvements, such as skipping
> a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
> which transitively hurts iceberg, because it uses avro for its manifests,
> doesn't it?
>
> As for the cutting edge stuff...anyone at ApacheCon reading this email on
> oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
> be presenting the results of hive using the vector IO version of ORC, and
> seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
> (300G). That doesn't need hive changes, just a build of ORC using the new
> API for async/parallel fetch of stripes. The parquet support with spark
> benchmarks is still a WiP, but I would expect to see similar numbers, and
> again, no changes to spark, just parquet
>
> And as the JMH microbenchmarks against the raw local FS show a 20x speedup
> in reads (async fetch into direct buffers), anyone running spark on a
> laptop should see some speedups too.
>
> Cloudera can ship this stuff internally. But the ASF projects are all
> stuck in time because of the belief that building against branch-2 makes
> sense. And it is transitive. Hive's requirements hold back iceberg, for
> example. (see also , PARQUET-2173. ...)
>
> If you want your applications to work better, especially in cloud, you
> should not just be running on a modern version of hadoop (and java11+,
> ideally), you *and your libraries* should be using the newer APIs to work
> with the data.
>
> Finally note that while that scatter/gather read call will only be on
> 3.3.5 we are doing a shim lib to offer the API to apps on older builds
> -it'll use readFully() to do the reads, just as the default implementation
> on all filesystems does on hadoop 3.3.5. See
> https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
> extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
> Obviously
>
> steve
>
> On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or not
>> useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>>
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>> 1) Scala 2.12 and without-hadoop distribution
>> 2) Scala 

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-04 Thread Steve Loughran
that sounds suspiciously like something I'd write :)

the move to java8 happened in HADOOP-11858; 3.0.0

HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
open since 2019 and I just closed as WONTFIX.

Most of the big production hadoop 2 clusters use java7, because that is
what they were deployed with and if you are upgrading java versions then
you'd want to upgrade to a java8 version of guava -with fixes, java8
version of jackson -with fixes, and at that point "upgrade the cluster"
becomes the strategy.

If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
not enough to set hadoop.version in the build, it needs full integration
testing with all those long-out-of-date transitive dependencies. And who
does that? nobody.


Does still claiming to support hadoop-2 cause any problems? Yes, because it
forces anything which wants to use more recent APIs either to play
reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
source trees (spark-hadoop-cloud), or stay stuck using older
classes/methods for no real benefit. Finally: what are you going to do if
someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
anyone really going to care?

Where this really frustrates me is in the libraries used downstream which
worry about java11, java17 compatibility etc still set hadoop.version to be
2.10, even though it blocks them from basic improvements, such as skipping
a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
which transitively hurts iceberg, because it uses avro for its manifests,
doesn't it?

As for the cutting edge stuff...anyone at ApacheCon reading this email on
oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
be presenting the results of hive using the vector IO version of ORC, and
seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
(300G). That doesn't need hive changes, just a build of ORC using the new
API for async/parallel fetch of stripes. The parquet support with spark
benchmarks is still a WiP, but I would expect to see similar numbers, and
again, no changes to spark, just parquet

And as the JMH microbenchmarks against the raw local FS show a 20x speedup
in reads (async fetch into direct buffers), anyone running spark on a
laptop should see some speedups too.

Cloudera can ship this stuff internally. But the ASF projects are all stuck
in time because of the belief that building against branch-2 makes sense.
And it is transitive. Hive's requirements hold back iceberg, for example.
(see also , PARQUET-2173. ...)

If you want your applications to work better, especially in cloud, you
should not just be running on a modern version of hadoop (and java11+,
ideally), you *and your libraries* should be using the newer APIs to work
with the data.

Finally note that while that scatter/gather read call will only be on 3.3.5
we are doing a shim lib to offer the API to apps on older builds -it'll use
readFully() to do the reads, just as the default implementation on all
filesystems does on hadoop 3.3.5. See
https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
Obviously

steve

On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun  wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or not
> useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you have to
>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>   > clusters won't be able to run your code. If your projects are java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
> 1) Scala 2.12 and without-hadoop distribution
> 2) Scala 2.12 and Hadoop 3 distribution
> 3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
> distribution?
>
> Dongjoon
>
> [1]
> 

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-03 Thread Yang,Jie(INF)
Hi, Dongjoon

Our company(Baidu) is still using the combination of Spark 3.3 + Hadoop 2.7.4 
in the production environment. Hadoop 2.7.4 is an internally maintained version 
compiled by Java 8. Although we are using Hadoop 2, I still support this 
proposal because it is positive and exciting.

Regards,
YangJie

发件人: Dongjoon Hyun 
日期: 2022年10月4日 星期二 11:16
收件人: dev 
主题: Dropping Apache Spark Hadoop2 Binary Distribution?

Hi, All.

I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
is still used by someone in the community or not. If it's not used or not 
useful,
we may remove it from Apache Spark 3.4.0 release.


https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz<https://mailshield.baidu.com/check?q=nFKjwur0WPBgNfrarJ1k%2fUbMkNasnbh1TmZiNzBvSuAAb596rlYk182hUiEqyXWjksmdGeptL3s8ghXMv%2buNxwrpF0RZUXK4QQKzVPN3u3Q%3d>

Here is the background of this question.
Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
Spark community has been building and releasing with Java 8 only.
I believe that the user applications also use Java8+ in these days.
Recently, I received the following message from the Hadoop PMC.

  > "if you really want to claim hadoop 2.x compatibility, then you have to
  > be building against java 7". Otherwise a lot of people with hadoop 2.x
  > clusters won't be able to run your code. If your projects are java8+
  > only, then they are implicitly hadoop 3.1+, no matter what you use
  > in your build. Hence: no need for branch-2 branches except
  > to complicate your build/test/release processes [1]

If Hadoop2 binary distribution is no longer used as of today,
or incomplete somewhere due to Java 8 building, the following three
existing alternative Hadoop 3 binary distributions could be
the better official solution for old Hadoop 2 clusters.

1) Scala 2.12 and without-hadoop distribution
2) Scala 2.12 and Hadoop 3 distribution
3) Scala 2.13 and Hadoop 3 distribution

In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary 
distribution?

Dongjoon

[1] 
https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247<https://mailshield.baidu.com/check?q=ydfs6JNIgVYX0c7s35hEbDKduTWJZfdqBlri9w1eAUmmi3MLIwhMNIpBPI11b4Ue4yyJduNrNLK%2bO6wv0EJEtYrfL79ZSK18xbM73fm3xOMIk17zxsTfggWFeJdpVDezLVjcWYU0dEW42Y%2bQGV6D7%2fdI48KLX9PGGjGB%2fy8OdRIr%2fu3WQWqH9dNa8Zmn4WvJib9TNaozHE4kzjjZrx8BAJkuUxTlBZOg>


Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-03 Thread Dongjoon Hyun
Hi, All.

I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
is still used by someone in the community or not. If it's not used or not
useful,
we may remove it from Apache Spark 3.4.0 release.


https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz

Here is the background of this question.
Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
Spark community has been building and releasing with Java 8 only.
I believe that the user applications also use Java8+ in these days.
Recently, I received the following message from the Hadoop PMC.

  > "if you really want to claim hadoop 2.x compatibility, then you have to
  > be building against java 7". Otherwise a lot of people with hadoop 2.x
  > clusters won't be able to run your code. If your projects are java8+
  > only, then they are implicitly hadoop 3.1+, no matter what you use
  > in your build. Hence: no need for branch-2 branches except
  > to complicate your build/test/release processes [1]

If Hadoop2 binary distribution is no longer used as of today,
or incomplete somewhere due to Java 8 building, the following three
existing alternative Hadoop 3 binary distributions could be
the better official solution for old Hadoop 2 clusters.

1) Scala 2.12 and without-hadoop distribution
2) Scala 2.12 and Hadoop 3 distribution
3) Scala 2.13 and Hadoop 3 distribution

In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
distribution?

Dongjoon

[1]
https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247