Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Yang Jie
hmm... I guess this is meant to cc  @Bingkun Pan ?


On 2024/03/05 02:16:12 Hyukjin Kwon wrote:
> Is this related to https://github.com/apache/spark/pull/42428?
> 
> cc @Yang,Jie(INF) 
> 
> On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
> wrote:
> 
> > Shall we revisit this functionality? The API doc is built with individual
> > versions, and for each individual version we depend on other released
> > versions. This does not seem to be right to me. Also, the functionality is
> > only in PySpark API doc which does not seem to be consistent as well.
> >
> > I don't think this is manageable with the current approach (listing
> > versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> > Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> > How about the time we are going to release the new version after releasing
> > 10 versions? What's the criteria of pruning the version?
> >
> > Unless we have a good answer to these questions, I think it's better to
> > revert the functionality - it missed various considerations.
> >
> > On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> > wrote:
> >
> >> Thanks for reporting - this is odd - the dropdown did not exist in other
> >> recent releases.
> >>
> >> https://spark.apache.org/docs/3.5.0/api/python/index.html
> >> https://spark.apache.org/docs/3.4.2/api/python/index.html
> >> https://spark.apache.org/docs/3.3.4/api/python/index.html
> >>
> >> Looks like the dropdown feature was recently introduced but partially
> >> done. The addition of a dropdown was done, but the way how to bump the
> >> version was missed to be documented.
> >> The contributor proposed the way to update the version "automatically",
> >> but the PR wasn't merged. As a result, we are neither having the
> >> instruction how to bump the version manually, nor having the automatic 
> >> bump.
> >>
> >> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
> >> * PR for automatically bumping version:
> >> https://github.com/apache/spark/pull/42881
> >>
> >> We will probably need to add an instruction in the release process to
> >> update the version. (For automatic bumping I don't have a good idea.)
> >> I'll look into it. Please expect some delay during the holiday weekend
> >> in S. Korea.
> >>
> >> Thanks again.
> >> Jungtaek Lim (HeartSaVioR)
> >>
> >>
> >> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
> >> wrote:
> >>
> >>> BTW, Jungtaek.
> >>>
> >>> PySpark document seems to show a wrong branch. At this time, `master`.
> >>>
> >>> https://spark.apache.org/docs/3.5.1/api/python/index.html
> >>>
> >>> PySpark Overview
> >>> 
> >>>
> >>>Date: Feb 24, 2024 Version: master
> >>>
> >>> [image: Screenshot 2024-02-29 at 21.12.24.png]
> >>>
> >>>
> >>> Could you do the follow-up, please?
> >>>
> >>> Thank you in advance.
> >>>
> >>> Dongjoon.
> >>>
> >>>
> >>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
> >>>
>  Excellent work, congratulations!
> 
>  On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>  wrote:
> 
> > Congratulations!
> >
> > Bests,
> > Dongjoon.
> >
> > On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
> >
> >> Congratulations!
> >>
> >>
> >>
> >> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> >> wrote:
> >>
> >> Hi everyone,
> >>
> >> We are happy to announce the availability of Spark 3.5.1!
> >>
> >> Spark 3.5.1 is a maintenance release containing stability fixes. This
> >> release is based on the branch-3.5 maintenance branch of Spark. We
> >> strongly
> >> recommend all 3.5 users to upgrade to this stable release.
> >>
> >> To download Spark 3.5.1, head over to the download page:
> >> https://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> >> https://spark.apache.org/releases/spark-release-3-5-1.html
> >>
> >> We would like to acknowledge all community members for contributing
> >> to this
> >> release. This release would not have been possible without you.
> >>
> >> Jungtaek Lim
> >>
> >> ps. Yikun is helping us through releasing the official docker image
> >> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
> >> available.
> >>
> >>
> 
>  --
>  John Zhuge
> 
> >>>
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion.

发件人: Jungtaek Lim 
日期: 2024年3月5日 星期二 10:46
收件人: Hyukjin Kwon 
抄送: yangjie01 , Dongjoon Hyun , 
dev , user 
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, 
it should be in versionless doc (spark-website) rather than the doc being 
pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Is this related to 
https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF)

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview

   Date: Feb 24, 2024 Version: master
[cid:image001.png@01DA6F13.CD4B0B00]



Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer 
mailto:belie...@163.com>> wrote:

Congratulations!





At 2024-02-28 17:43:25, "Jungtaek Lim" 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version
switcher, it should be in versionless doc (spark-website) rather than the
doc being pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon  wrote:

> Is this related to https://github.com/apache/spark/pull/42428?
>
> cc @Yang,Jie(INF) 
>
> On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
> wrote:
>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed to be documented.
>>> The contributor proposed the way to update the version "automatically",
>>> but the PR wasn't merged. As a result, we are neither having the
>>> instruction how to bump the version manually, nor having the automatic bump.
>>>
>>> * PR for addition of dropdown:
>>> https://github.com/apache/spark/pull/42428
>>> * PR for automatically bumping version:
>>> https://github.com/apache/spark/pull/42881
>>>
>>> We will probably need to add an instruction in the release process to
>>> update the version. (For automatic bumping I don't have a good idea.)
>>> I'll look into it. Please expect some delay during the holiday weekend
>>> in S. Korea.
>>>
>>> Thanks again.
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>>> wrote:
>>>
 BTW, Jungtaek.

 PySpark document seems to show a wrong branch. At this time, `master`.

 https://spark.apache.org/docs/3.5.1/api/python/index.html

 PySpark Overview
 

Date: Feb 24, 2024 Version: master

 [image: Screenshot 2024-02-29 at 21.12.24.png]


 Could you do the follow-up, please?

 Thank you in advance.

 Dongjoon.


 On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing
>>> to this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image
>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>> available.
>>>
>>>
>
> --
> John Zhuge
>



Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image
>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>> available.
>>
>>

 --
 John Zhuge

>>>


Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Prem Sahoo
Thanks Jason for detailed information and big associated with it.
Hopefully someone provided more information about this pressing issue.

On Mon, Mar 4, 2024 at 1:26 PM Jason Xu  wrote:

> Hi Prem,
>
> From the symptom of shuffle fetch failure and few duplicate data and few
> missing data, I think you might run into this correctness bug:
> https://issues.apache.org/jira/browse/SPARK-38388.
>
> Node/shuffle failure is hard to avoid, I wonder if you have
> non-deterministic logic and calling repartition() (round robin
> partitioning) in your code? If you can avoid either of these, you can avoid
> the issue from happening for now. To root fix the issue, it requires a
> non-trivial effort, I don't think there's a solution available yet.
>
> I have heard that there are community efforts to solve this issue, but I
> lack detailed information. Hopefully, someone with more knowledge can
> provide further insight.
>
> Best,
> Jason
>
> On Mon, Mar 4, 2024 at 9:41 AM Prem Sahoo  wrote:
>
>> super :(
>>
>> On Mon, Mar 4, 2024 at 6:19 AM Mich Talebzadeh 
>> wrote:
>>
>>> "... in a nutshell  if fetchFailedException occurs due to data node
>>> reboot then it  can create duplicate / missing data  .   so this is more of
>>> hardware(env issue ) rather than spark issue ."
>>>
>>> As an overall conclusion your point is correct but again the answer is
>>> not binary.
>>>
>>> Spark core relies on a distributed file system to store data across data
>>> nodes. When Spark needs to process data, it fetches the required blocks
>>> from the data nodes.* FetchFailedException*: means  that Spark
>>> encountered an error while fetching data blocks from a data node. If a data
>>> node reboots unexpectedly, it becomes unavailable to Spark for a
>>> period. During this time, Spark might attempt to fetch data blocks from the
>>> unavailable node, resulting in the FetchFailedException.. Depending on the
>>> timing and nature of the reboot and data access, this exception can lead
>>> to:the following:
>>>
>>>- Duplicate Data: If Spark retries the fetch operation successfully
>>>after the reboot, it might end up processing the same data twice, leading
>>>to duplicates.
>>>- Missing Data: If Spark cannot fetch all required data blocks due
>>>to the unavailable data node, some data might be missing from the
>>>processing results.
>>>
>>> The root cause of this issue lies in the data node reboot itself. So we
>>> can conclude that it is not a  problem with Spark core functionality but
>>> rather an environmental issue within the distributed storage systemL  You
>>> need to ensure that your nodes are stable and minimise unexpected reboots
>>> for whatever reason. Look at the host logs  or run /usr/bin/dmesg to see
>>> what happened..
>>>
>>> Good luck
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:
>>>
 thanks Mich, in a nutshell  if fetchFailedException occurs due to data
 node reboot then it  can create duplicate / missing data  .   so this is
 more of hardware(env issue ) rather than spark issue .



 On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> It seems to me that there are issues related to below
>
> * I think when a task failed in between  and retry task started
> and completed it may create duplicate as failed task has some data + retry
> task has  full data.  but my question is why spark keeps delta data or
> according to you if speculative and original task completes generally 
> spark
> kills one of the tasks to get rid of dups data.  when a data node is
> rebooted then spark fault tolerant should go to other nodes isn't it ? 
> then
> why it has missing data.*
>
> Spark is designed to be fault-tolerant through lineage and
> recomputation. However, there are scenarios where speculative execution or
> task retries might lead to duplicated or missing data. So what are these?
>
> - Task Failure and Retry: You are correct that a failed task might
> have processed some data before encountering the FetchFailedException. If 
> a
> retry succeeds, it would process the entire data partition again, leading
> to duplicates. When a task fails, Spark may recompute the lost data by
> 

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Jason Xu
Hi Prem,

>From the symptom of shuffle fetch failure and few duplicate data and few
missing data, I think you might run into this correctness bug:
https://issues.apache.org/jira/browse/SPARK-38388.

Node/shuffle failure is hard to avoid, I wonder if you have
non-deterministic logic and calling repartition() (round robin
partitioning) in your code? If you can avoid either of these, you can avoid
the issue from happening for now. To root fix the issue, it requires a
non-trivial effort, I don't think there's a solution available yet.

I have heard that there are community efforts to solve this issue, but I
lack detailed information. Hopefully, someone with more knowledge can
provide further insight.

Best,
Jason

On Mon, Mar 4, 2024 at 9:41 AM Prem Sahoo  wrote:

> super :(
>
> On Mon, Mar 4, 2024 at 6:19 AM Mich Talebzadeh 
> wrote:
>
>> "... in a nutshell  if fetchFailedException occurs due to data node
>> reboot then it  can create duplicate / missing data  .   so this is more of
>> hardware(env issue ) rather than spark issue ."
>>
>> As an overall conclusion your point is correct but again the answer is
>> not binary.
>>
>> Spark core relies on a distributed file system to store data across data
>> nodes. When Spark needs to process data, it fetches the required blocks
>> from the data nodes.* FetchFailedException*: means  that Spark
>> encountered an error while fetching data blocks from a data node. If a data
>> node reboots unexpectedly, it becomes unavailable to Spark for a
>> period. During this time, Spark might attempt to fetch data blocks from the
>> unavailable node, resulting in the FetchFailedException.. Depending on the
>> timing and nature of the reboot and data access, this exception can lead
>> to:the following:
>>
>>- Duplicate Data: If Spark retries the fetch operation successfully
>>after the reboot, it might end up processing the same data twice, leading
>>to duplicates.
>>- Missing Data: If Spark cannot fetch all required data blocks due to
>>the unavailable data node, some data might be missing from the processing
>>results.
>>
>> The root cause of this issue lies in the data node reboot itself. So we
>> can conclude that it is not a  problem with Spark core functionality but
>> rather an environmental issue within the distributed storage systemL  You
>> need to ensure that your nodes are stable and minimise unexpected reboots
>> for whatever reason. Look at the host logs  or run /usr/bin/dmesg to see
>> what happened..
>>
>> Good luck
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:
>>
>>> thanks Mich, in a nutshell  if fetchFailedException occurs due to data
>>> node reboot then it  can create duplicate / missing data  .   so this is
>>> more of hardware(env issue ) rather than spark issue .
>>>
>>>
>>>
>>> On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 It seems to me that there are issues related to below

 * I think when a task failed in between  and retry task started
 and completed it may create duplicate as failed task has some data + retry
 task has  full data.  but my question is why spark keeps delta data or
 according to you if speculative and original task completes generally spark
 kills one of the tasks to get rid of dups data.  when a data node is
 rebooted then spark fault tolerant should go to other nodes isn't it ? then
 why it has missing data.*

 Spark is designed to be fault-tolerant through lineage and
 recomputation. However, there are scenarios where speculative execution or
 task retries might lead to duplicated or missing data. So what are these?

 - Task Failure and Retry: You are correct that a failed task might have
 processed some data before encountering the FetchFailedException. If a
 retry succeeds, it would process the entire data partition again, leading
 to duplicates. When a task fails, Spark may recompute the lost data by
 recomputing the lost task on another node.  The output of the retried task
 is typically combined with the output of the original task during the final
 stage of the computation. This combination is done to handle scenarios
 where the original task partially completed and generated some output
 before failing. Spark does not 

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Prem Sahoo
super :(

On Mon, Mar 4, 2024 at 6:19 AM Mich Talebzadeh 
wrote:

> "... in a nutshell  if fetchFailedException occurs due to data node reboot
> then it  can create duplicate / missing data  .   so this is more of
> hardware(env issue ) rather than spark issue ."
>
> As an overall conclusion your point is correct but again the answer is not
> binary.
>
> Spark core relies on a distributed file system to store data across data
> nodes. When Spark needs to process data, it fetches the required blocks
> from the data nodes.* FetchFailedException*: means  that Spark
> encountered an error while fetching data blocks from a data node. If a data
> node reboots unexpectedly, it becomes unavailable to Spark for a
> period. During this time, Spark might attempt to fetch data blocks from the
> unavailable node, resulting in the FetchFailedException.. Depending on the
> timing and nature of the reboot and data access, this exception can lead
> to:the following:
>
>- Duplicate Data: If Spark retries the fetch operation successfully
>after the reboot, it might end up processing the same data twice, leading
>to duplicates.
>- Missing Data: If Spark cannot fetch all required data blocks due to
>the unavailable data node, some data might be missing from the processing
>results.
>
> The root cause of this issue lies in the data node reboot itself. So we
> can conclude that it is not a  problem with Spark core functionality but
> rather an environmental issue within the distributed storage systemL  You
> need to ensure that your nodes are stable and minimise unexpected reboots
> for whatever reason. Look at the host logs  or run /usr/bin/dmesg to see
> what happened..
>
> Good luck
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:
>
>> thanks Mich, in a nutshell  if fetchFailedException occurs due to data
>> node reboot then it  can create duplicate / missing data  .   so this is
>> more of hardware(env issue ) rather than spark issue .
>>
>>
>>
>> On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> It seems to me that there are issues related to below
>>>
>>> * I think when a task failed in between  and retry task started
>>> and completed it may create duplicate as failed task has some data + retry
>>> task has  full data.  but my question is why spark keeps delta data or
>>> according to you if speculative and original task completes generally spark
>>> kills one of the tasks to get rid of dups data.  when a data node is
>>> rebooted then spark fault tolerant should go to other nodes isn't it ? then
>>> why it has missing data.*
>>>
>>> Spark is designed to be fault-tolerant through lineage and
>>> recomputation. However, there are scenarios where speculative execution or
>>> task retries might lead to duplicated or missing data. So what are these?
>>>
>>> - Task Failure and Retry: You are correct that a failed task might have
>>> processed some data before encountering the FetchFailedException. If a
>>> retry succeeds, it would process the entire data partition again, leading
>>> to duplicates. When a task fails, Spark may recompute the lost data by
>>> recomputing the lost task on another node.  The output of the retried task
>>> is typically combined with the output of the original task during the final
>>> stage of the computation. This combination is done to handle scenarios
>>> where the original task partially completed and generated some output
>>> before failing. Spark does not intentionally store partially processed
>>> data. However, due to retries and speculative execution, duplicate
>>> processing can occur. To the best of my knowledge, Spark itself doesn't
>>> have a mechanism to identify and eliminate duplicates automatically. While
>>> Spark might sometimes kill speculative tasks if the original one finishes,
>>> it is not a guaranteed behavior. This depends on various factors like
>>> scheduling and task dependencies.
>>>
>>> - Speculative Execution: Spark supports speculative execution, where the
>>> same task is launched on multiple executors simultaneously. The result of
>>> the first completed task is used, and the others are usually killed to
>>> avoid duplicated results. However, speculative execution might introduce
>>> some duplication in the final output if tasks on different executors
>>> complete successfully.
>>>
>>> - Node Reboots and Fault 

Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-04 Thread Mich Talebzadeh
"... in a nutshell  if fetchFailedException occurs due to data node reboot
then it  can create duplicate / missing data  .   so this is more of
hardware(env issue ) rather than spark issue ."

As an overall conclusion your point is correct but again the answer is not
binary.

Spark core relies on a distributed file system to store data across data
nodes. When Spark needs to process data, it fetches the required blocks
from the data nodes.* FetchFailedException*: means  that Spark encountered
an error while fetching data blocks from a data node. If a data node
reboots unexpectedly, it becomes unavailable to Spark for a period. During
this time, Spark might attempt to fetch data blocks from the unavailable
node, resulting in the FetchFailedException.. Depending on the timing and
nature of the reboot and data access, this exception can lead to:the
following:

   - Duplicate Data: If Spark retries the fetch operation successfully
   after the reboot, it might end up processing the same data twice, leading
   to duplicates.
   - Missing Data: If Spark cannot fetch all required data blocks due to
   the unavailable data node, some data might be missing from the processing
   results.

The root cause of this issue lies in the data node reboot itself. So we can
conclude that it is not a  problem with Spark core functionality but rather
an environmental issue within the distributed storage systemL  You need to
ensure that your nodes are stable and minimise unexpected reboots for
whatever reason. Look at the host logs  or run /usr/bin/dmesg to see what
happened..

Good luck

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 4 Mar 2024 at 01:30, Prem Sahoo  wrote:

> thanks Mich, in a nutshell  if fetchFailedException occurs due to data
> node reboot then it  can create duplicate / missing data  .   so this is
> more of hardware(env issue ) rather than spark issue .
>
>
>
> On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> It seems to me that there are issues related to below
>>
>> * I think when a task failed in between  and retry task started and
>> completed it may create duplicate as failed task has some data + retry task
>> has  full data.  but my question is why spark keeps delta data or
>> according to you if speculative and original task completes generally spark
>> kills one of the tasks to get rid of dups data.  when a data node is
>> rebooted then spark fault tolerant should go to other nodes isn't it ? then
>> why it has missing data.*
>>
>> Spark is designed to be fault-tolerant through lineage and recomputation.
>> However, there are scenarios where speculative execution or task retries
>> might lead to duplicated or missing data. So what are these?
>>
>> - Task Failure and Retry: You are correct that a failed task might have
>> processed some data before encountering the FetchFailedException. If a
>> retry succeeds, it would process the entire data partition again, leading
>> to duplicates. When a task fails, Spark may recompute the lost data by
>> recomputing the lost task on another node.  The output of the retried task
>> is typically combined with the output of the original task during the final
>> stage of the computation. This combination is done to handle scenarios
>> where the original task partially completed and generated some output
>> before failing. Spark does not intentionally store partially processed
>> data. However, due to retries and speculative execution, duplicate
>> processing can occur. To the best of my knowledge, Spark itself doesn't
>> have a mechanism to identify and eliminate duplicates automatically. While
>> Spark might sometimes kill speculative tasks if the original one finishes,
>> it is not a guaranteed behavior. This depends on various factors like
>> scheduling and task dependencies.
>>
>> - Speculative Execution: Spark supports speculative execution, where the
>> same task is launched on multiple executors simultaneously. The result of
>> the first completed task is used, and the others are usually killed to
>> avoid duplicated results. However, speculative execution might introduce
>> some duplication in the final output if tasks on different executors
>> complete successfully.
>>
>> - Node Reboots and Fault Tolerance: If the data node reboot leads to data
>> corruption or loss, that data might be unavailable to Spark. Even with
>> fault tolerance, Spark cannot recover completely missing data. Fault
>> tolerance