Hello,

We run our spark workloads on spot and we would like to quantify the impact
of spot interruptions on our workloads. We are proposing the following
metric but would like your opinions on it

We are leveraging Spark's Event Listener and performing the following

T = task

T1 = sum(T.execution-time) for all T where T.status=failed and
T.stage-attempt-number = 0

T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0

Tall = sum(T.execution-time)

Retry% = (T1 + T2) / Tall

The assumption is that

T1 – IF a stage is executing for the first time then only tasks that failed
was waste
T2 – every task executed for a stage with stage-attempt-number > 0 is a
retry since the stage was succeeded previously

Reply via email to