Re: Checkpointing in foreachPartition in Spark batck

Ángel Álvarez Pascua Wed, 16 Apr 2025 23:09:42 -0700

Just a quick note on working at the RDD level in Spark — once you go down
to that level, it’s entirely up to you to handle everything. You gain more
control and flexibility, but Spark steps back and hands you the steering
wheel. If tasks fail, it's usually because you're allowing them to — by not
properly capturing or handling all potential errors.


Last year, I worked on a Spark Streaming project that had to interact with
various external systems (OpenSearch, Salesforce, Microsoft Office 365, a
ticketing service, ...) through HTTP requests. We handled those
interactions the way I told you, and it’s been running in production
without issues since last summer.

That said, a word of caution: while we didn’t face any issues, some
colleagues on a different project ran into a problem when calling a model
inside a UDF. In certain cases — especially after using the built-in explode
function — Spark ended up calling the model twice per row. Their workaround
was to cache the DataFrame before invoking the model. (I don’t think Spark
speculation was enabled in their case, by the way.)

I still need to dig deeper into the root cause, but just a heads-up — under
certain conditions, Spark might trigger multiple executions. We also used
the explode function in our project, but didn’t observe any duplicate
calls... so for now, it remains a bit of a mystery.



El jue, 17 abr 2025 a las 2:02, daniel williams (<daniel.willi...@gmail.com>)
escribió:

> Apologies. I’m imagining a pure Kafka application and imagining a
> transactional consumer on processing. Angel’s suggestion is the correct
> one, mapPartition to maintain state across your broadcast Kafka producers
> to retry or introduce back pressure given your producer and retry in its
> threadpool given the producers built in backoff. This would allow for you
> to retry n times and then upon final (hopefully not) failure update your
> dataset for further processing.
>
> -dan
>
>
> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla <
> abhisheksingla...@gmail.com> wrote:
>
>> @daniel williams <daniel.willi...@gmail.com>
>>
>> > Operate your producer in transactional mode.
>> Could you elaborate more on this. How to achieve this with
>> foreachPartition and HTTP Client.
>>
>> > Checkpointing is an abstract concept only applicable to streaming.
>> That means checkpointing is not supported in batch mode, right?
>>
>> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com>
>>
>> > What about returning the HTTP codes as a dataframe result in the
>> foreachPartition
>> I believe you are referring to mapParitions, that could be done. The main
>> concern here is that the issue is not only with failures due to the third
>> system but also with task/job failures. I wanted to know if there is an
>> existing way in spark batch to checkpoint already processed rows of a
>> partition if using foreachPartition or mapParitions, so that they are not
>> processed again on rescheduling of task due to failure or retriggering of
>> job due to failures.
>>
>> Regards,
>> Abhishek Singla
>>
>> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua <
>> angel.alvarez.pas...@gmail.com> wrote:
>>
>>> What about returning the HTTP codes as a dataframe result in the
>>> foreachPartition, saving it to files/table/whatever and then performing a
>>> join to discard already ok processed rows when you it try again?
>>>
>>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (<
>>> abhisheksingla...@gmail.com>) escribió:
>>>
>>>> Hi Team,
>>>>
>>>> We are using foreachPartition to send dataset row data to third system
>>>> via HTTP client. The operation is not idempotent. I wanna ensure that in
>>>> case of failures the previously processed dataset should not get processed
>>>> again.
>>>>
>>>> Is there a way to checkpoint in Spark batch
>>>> 1. checkpoint processed partitions so that if there are 1000 partitions
>>>> and 100 were processed in the previous batch, they should not get processed
>>>> again.
>>>> 2. checkpoint partial partition in foreachPartition, if I have
>>>> processed 100 records from a partition which have 1000 total records, is
>>>> there a way to checkpoint offset so that those 100 should not get processed
>>>> again.
>>>>
>>>> Regards,
>>>> Abhishek Singla
>>>>
>>>

Re: Checkpointing in foreachPartition in Spark batck

Reply via email to