Re: When Spark job shows FetchFailedException it creates few duplicate data and we see few data also missing , please explain why

2024-03-03 Thread Prem Sahoo
thanks Mich, in a nutshell  if fetchFailedException occurs due to data node
reboot then it  can create duplicate / missing data  .   so this is more of
hardware(env issue ) rather than spark issue .



On Sat, Mar 2, 2024 at 7:45 AM Mich Talebzadeh 
wrote:

> Hi,
>
> It seems to me that there are issues related to below
>
> * I think when a task failed in between  and retry task started and
> completed it may create duplicate as failed task has some data + retry task
> has  full data.  but my question is why spark keeps delta data or
> according to you if speculative and original task completes generally spark
> kills one of the tasks to get rid of dups data.  when a data node is
> rebooted then spark fault tolerant should go to other nodes isn't it ? then
> why it has missing data.*
>
> Spark is designed to be fault-tolerant through lineage and recomputation.
> However, there are scenarios where speculative execution or task retries
> might lead to duplicated or missing data. So what are these?
>
> - Task Failure and Retry: You are correct that a failed task might have
> processed some data before encountering the FetchFailedException. If a
> retry succeeds, it would process the entire data partition again, leading
> to duplicates. When a task fails, Spark may recompute the lost data by
> recomputing the lost task on another node.  The output of the retried task
> is typically combined with the output of the original task during the final
> stage of the computation. This combination is done to handle scenarios
> where the original task partially completed and generated some output
> before failing. Spark does not intentionally store partially processed
> data. However, due to retries and speculative execution, duplicate
> processing can occur. To the best of my knowledge, Spark itself doesn't
> have a mechanism to identify and eliminate duplicates automatically. While
> Spark might sometimes kill speculative tasks if the original one finishes,
> it is not a guaranteed behavior. This depends on various factors like
> scheduling and task dependencies.
>
> - Speculative Execution: Spark supports speculative execution, where the
> same task is launched on multiple executors simultaneously. The result of
> the first completed task is used, and the others are usually killed to
> avoid duplicated results. However, speculative execution might introduce
> some duplication in the final output if tasks on different executors
> complete successfully.
>
> - Node Reboots and Fault Tolerance: If the data node reboot leads to data
> corruption or loss, that data might be unavailable to Spark. Even with
> fault tolerance, Spark cannot recover completely missing data. Fault
> tolerance focuses on recovering from issues like executor failures, not
> data loss on storage nodes. Overall, Spark's fault tolerance is designed to
> handle executor failures by rescheduling tasks on other available executors
> and temporary network issues by retrying fetches based on configuration.
>
> Here are some stuff to consider:
>
> - Minimize retries: Adjust spark.shuffle.io.maxRetries to a lower value
> such as  1 or 2 to reduce the chance of duplicate processing attempts, if
> retries are suspected to be a source.
> - Disable speculative execution if needed: Consider disabling speculative
> execution (spark.speculation=false) if duplicates are a major concern.
> However, this might impact performance.
> - Data persistence: As mentioned in the previous reply, persist
> intermediate data to reliable storage (HDFS, GCS, etc.) if data integrity
> is critical. This ensures data availability even during node failures.
> - Data validation checks: Implement data validation checks after
> processing to identify potential duplicates or missing data.
> HTH
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 2 Mar 2024 at 01:43, Prem Sahoo  wrote:
>
>> Hello Mich,
>> thanks for your reply.
>>
>> As an engineer I can chip in. You may have partial execution and retries
>> meaning when spark encounters a *FetchFailedException*, it  may retry
>> fetching the data from the unavailable (the one being rebooted) node a few
>> times before marking it permanently unavailable. However, if the rebooted
>> node recovers quickly within this retry window, some executors might
>> successfully fetch the data after a retry. *This leads to duplicate
>> processing of the same data partition*.
>>
>>  data node 

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual
versions, and for each individual version we depend on other released
versions. This does not seem to be right to me. Also, the functionality is
only in PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing
versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
How about the time we are going to release the new version after releasing
10 versions? What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to
revert the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
wrote:

> Thanks for reporting - this is odd - the dropdown did not exist in other
> recent releases.
>
> https://spark.apache.org/docs/3.5.0/api/python/index.html
> https://spark.apache.org/docs/3.4.2/api/python/index.html
> https://spark.apache.org/docs/3.3.4/api/python/index.html
>
> Looks like the dropdown feature was recently introduced but partially
> done. The addition of a dropdown was done, but the way how to bump the
> version was missed to be documented.
> The contributor proposed the way to update the version "automatically",
> but the PR wasn't merged. As a result, we are neither having the
> instruction how to bump the version manually, nor having the automatic bump.
>
> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
> * PR for automatically bumping version:
> https://github.com/apache/spark/pull/42881
>
> We will probably need to add an instruction in the release process to
> update the version. (For automatic bumping I don't have a good idea.)
> I'll look into it. Please expect some delay during the holiday weekend
> in S. Korea.
>
> Thanks again.
> Jungtaek Lim (HeartSaVioR)
>
>
> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
> wrote:
>
>> BTW, Jungtaek.
>>
>> PySpark document seems to show a wrong branch. At this time, `master`.
>>
>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>
>> PySpark Overview
>> 
>>
>>Date: Feb 24, 2024 Version: master
>>
>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>
>>
>> Could you do the follow-up, please?
>>
>> Thank you in advance.
>>
>> Dongjoon.
>>
>>
>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>
>>> Excellent work, congratulations!
>>>
>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>>> wrote:
>>>
 Congratulations!

 Bests,
 Dongjoon.

 On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:

> Congratulations!
>
>
>
> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> wrote:
>
> Hi everyone,
>
> We are happy to announce the availability of Spark 3.5.1!
>
> Spark 3.5.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.5 maintenance branch of Spark. We
> strongly
> recommend all 3.5 users to upgrade to this stable release.
>
> To download Spark 3.5.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-1.html
>
> We would like to acknowledge all community members for contributing to
> this
> release. This release would not have been possible without you.
>
> Jungtaek Lim
>
> ps. Yikun is helping us through releasing the official docker image
> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
> available.
>
>
>>>
>>> --
>>> John Zhuge
>>>
>>