Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Jungtaek Lim
Sounds like a plan. +1 (non-binding) Thanks for volunteering!

On Sun, Apr 7, 2024 at 5:45 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
> Dongjoon.
>


Fwd: Apache Spark 3.4.3 (?)

2024-04-07 Thread Mich Talebzadeh
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


-- Forwarded message -
From: Mich Talebzadeh 
Date: Sun, 7 Apr 2024 at 11:56
Subject: Re: Apache Spark 3.4.3 (?)
To: Dongjoon Hyun 


Yes given that a good number of people are using some flavour of 3.4.n,
this will be a good fit.

+1 for me


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 6 Apr 2024 at 23:02, Dongjoon Hyun  wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>


Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread L. C. Hsieh
+1

Thanks Dongjoon!

On Sun, Apr 7, 2024 at 1:56 AM Kent Yao  wrote:
>
> +1, thank you, Dongjoon
>
>
> Kent
>
> Holden Karau  于2024年4月7日周日 14:54写道:
> >
> > Sounds good to me :)
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.): 
> > https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun  
> > wrote:
> >>
> >> Hi, All.
> >>
> >> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 
> >> commits including important security and correctness patches like 
> >> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
> >>
> >> https://github.com/apache/spark/releases/tag/v3.4.2
> >>
> >> $ git log --oneline v3.4.2..HEAD | wc -l
> >>   85
> >>
> >> SPARK-45580 Subquery changes the output schema of the outer query
> >> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect 
> >> results
> >> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp 
> >> ntz
> >> SPARK-46794 Incorrect results due to inferred predicate from checkpoint 
> >> with subquery
> >> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> >> SPARK-45445 Upgrade snappy to 1.1.10.5
> >> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> >> SPARK-46239 Hide `Jetty` info
> >>
> >>
> >> Currently, I'm checking more applicable patches for branch-3.4. I'd like 
> >> to propose to release Apache Spark 3.4.3 and volunteer as the release 
> >> manager for Apache Spark 3.4.3. If there are no additional blockers, the 
> >> first tentative RC1 vote date is April 15th (Monday).
> >>
> >> WDYT?
> >>
> >>
> >> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look.

Cheers

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sun, 7 Apr 2024 at 15:08, Cheng Pan  wrote:

> Instead of External Shuffle Shufle, Apache Celeborn might be a good option
> as a Remote Shuffle Service for Spark on K8s.
>
> There are some useful resources you might be interested in.
>
> [1] https://celeborn.apache.org/
> [2] https://www.youtube.com/watch?v=s5xOtG6Venw
> [3] https://github.com/aws-samples/emr-remote-shuffle-service
> [4] https://github.com/apache/celeborn/issues/2140
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 6, 2024, at 21:41, Mich Talebzadeh 
> wrote:
> >
> > I have seen some older references for shuffle service for k8s,
> > although it is not clear they are talking about a generic shuffle
> > service for k8s.
> >
> > Anyhow with the advent of genai and the need to allow for a larger
> > volume of data, I was wondering if there has been any more work on
> > this matter. Specifically larger and scalable file systems like HDFS,
> > GCS , S3 etc, offer significantly larger storage capacity than local
> > disks on individual worker nodes in a k8s cluster, thus allowing
> > handling much larger datasets more efficiently. Also the degree of
> > parallelism and fault tolerance  with these files systems come into
> > it. I will be interested in hearing more about any progress on this.
> >
> > Thanks
> > .
> >
> > Mich Talebzadeh,
> >
> > Technologist | Solutions Architect | Data Engineer  | Generative AI
> >
> > London
> > United Kingdom
> >
> >
> >   view my Linkedin profile
> >
> >
> > https://en.everybodywiki.com/Mich_Talebzadeh
> >
> >
> >
> > Disclaimer: The information provided is correct to the best of my
> > knowledge but of course cannot be guaranteed . It is essential to note
> > that, as with any advice, quote "one test result is worth one-thousand
> > expert opinions (Werner Von Braun)".
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a 
Remote Shuffle Service for Spark on K8s.

There are some useful resources you might be interested in.

[1] https://celeborn.apache.org/
[2] https://www.youtube.com/watch?v=s5xOtG6Venw
[3] https://github.com/aws-samples/emr-remote-shuffle-service
[4] https://github.com/apache/celeborn/issues/2140

Thanks,
Cheng Pan


> On Apr 6, 2024, at 21:41, Mich Talebzadeh  wrote:
> 
> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
> 
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
> 
> Thanks
> .
> 
> Mich Talebzadeh,
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> 
> London
> United Kingdom
> 
> 
>   view my Linkedin profile
> 
> 
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid

The configurations below can be used with k8s deployments of Spark. Spark
applications running on k8s can utilize these configurations to seamlessly
access data stored in Google Cloud Storage (GCS) and Amazon S3.

For Google GCS we may have

spark_config_gcs = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.gs.impl":
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"spark.hadoop.google.cloud.auth.service.account.enable": "true",
"spark.hadoop.google.cloud.auth.service.account.json.keyfile":
"/path/to/keyfile.json",
}

For Amazon S3 similar

spark_config_s3 = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.access.key": "s3_access_key",
"spark.hadoop.fs.s3a.secret.key": "secret_key",
}


To implement these configurations and enable Spark applications to interact
with GCS and S3, I guess we can approach it this way

1) Spark Repository Integration: These configurations need to be added to
the Spark repository as part of the supported configuration options for k8s
deployments.

2) Configuration Settings: Users need to specify these configurations when
submitting Spark applications to a Kubernetes cluster. They can include
these configurations in the Spark application code or pass them as
command-line arguments or environment variables during application
submission.

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov 
wrote:

> There is an IBM shuffle service plugin that supports S3
> https://github.com/IBM/spark-s3-shuffle
>
> Though I would think a feature like this could be a part of the main Spark
> repo. Trino already has out-of-box support for s3 exchange (shuffle) and
> it's very useful.
>
> Vakaris
>
> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
> wrote:
>
>>
>> Thanks for your suggestion that I take it as a workaround. Whilst this
>> workaround can potentially address storage allocation issues, I was more
>> interested in exploring solutions that offer a more seamless integration
>> with large distributed file systems like HDFS, GCS, or S3. This would
>> ensure better performance and scalability for handling larger datasets
>> efficiently.
>>
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
>> wrote:
>>
>>> You can make a PVC on K8S call it 300GB
>>>
>>> make a folder in yours dockerfile
>>> WORKDIR /opt/spark/work-dir
>>> RUN chmod g+w /opt/spark/work-dir
>>>
>>> start spark with adding this
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>> "/opt/spark/work-dir") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>> "False") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>>> "/opt/spark/work-dir") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>> "False") \
>>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>>
>>>
>>>
>>>
>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
 I have seen some older references for shuffle service for k8s,
 although it is not clear they are talking about a generic shuffle
 service for k8s.

 Anyhow with the advent of genai and the need to allow for a larger
 volume of data, I was wondering if there has been any more work on
 this 

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3
https://github.com/IBM/spark-s3-shuffle

Though I would think a feature like this could be a part of the main Spark
repo. Trino already has out-of-box support for s3 exchange (shuffle) and
it's very useful.

Vakaris

On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
wrote:

>
> Thanks for your suggestion that I take it as a workaround. Whilst this
> workaround can potentially address storage allocation issues, I was more
> interested in exploring solutions that offer a more seamless integration
> with large distributed file systems like HDFS, GCS, or S3. This would
> ensure better performance and scalability for handling larger datasets
> efficiently.
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
> wrote:
>
>> You can make a PVC on K8S call it 300GB
>>
>> make a folder in yours dockerfile
>> WORKDIR /opt/spark/work-dir
>> RUN chmod g+w /opt/spark/work-dir
>>
>> start spark with adding this
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>
>>
>>
>>
>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> I have seen some older references for shuffle service for k8s,
>>> although it is not clear they are talking about a generic shuffle
>>> service for k8s.
>>>
>>> Anyhow with the advent of genai and the need to allow for a larger
>>> volume of data, I was wondering if there has been any more work on
>>> this matter. Specifically larger and scalable file systems like HDFS,
>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>> handling much larger datasets more efficiently. Also the degree of
>>> parallelism and fault tolerance  with these files systems come into
>>> it. I will be interested in hearing more about any progress on this.
>>>
>>> Thanks
>>> .
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Kent Yao
+1, thank you, Dongjoon


Kent

Holden Karau  于2024年4月7日周日 14:54写道:
>
> Sounds good to me :)
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun  wrote:
>>
>> Hi, All.
>>
>> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 
>> commits including important security and correctness patches like 
>> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>>
>> https://github.com/apache/spark/releases/tag/v3.4.2
>>
>> $ git log --oneline v3.4.2..HEAD | wc -l
>>   85
>>
>> SPARK-45580 Subquery changes the output schema of the outer query
>> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect 
>> results
>> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp 
>> ntz
>> SPARK-46794 Incorrect results due to inferred predicate from checkpoint with 
>> subquery
>> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
>> SPARK-45445 Upgrade snappy to 1.1.10.5
>> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
>> SPARK-46239 Hide `Jetty` info
>>
>>
>> Currently, I'm checking more applicable patches for branch-3.4. I'd like to 
>> propose to release Apache Spark 3.4.3 and volunteer as the release manager 
>> for Apache Spark 3.4.3. If there are no additional blockers, the first 
>> tentative RC1 vote date is April 15th (Monday).
>>
>> WDYT?
>>
>>
>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org