Accounting the impact of failures in spark jobs

Faiz Halde Fri, 19 Apr 2024 05:06:36 -0700

Hello,

In my organization, we have an accounting system for spark jobs that uses
the task execution time to determine how much time a spark job uses the
executors for and we use it as a way to segregate cost. We sum all the task
times per job and apply proportions. Our clusters follow a 1 task per core
model & this works well.


A job goes through several failures during its run, due to executor
failure, node failure ( spot interruptions ), and spark retries tasks &
sometimes entire stages.

We now want to account for this failure and determine what % of a job's
total task time is due to these retries. Basically, if a job with failures
& retries has a total task time of X, there is a X' representing the
goodput of this job – i.e. a hypothetical run of the job with 0 failures &
retries. In this case, ( X-X' ) / X quantifies the cost of failures.

This form of accounting requires tracking execution history of each task
i.e. tasks that compute the same logical partition of some RDD. This was
quite easy with AQE disabled as stage ids never changed, but with AQE
enabled that's no longer the case.

Do you have any suggestions on how I can use the Spark event system?

Thanks
Faiz

Accounting the impact of failures in spark jobs

Reply via email to