Hello, We run our spark workloads on spot and we would like to quantify the impact of spot interruptions on our workloads. We are proposing the following metric but would like your opinions on it
We are leveraging Spark's Event Listener and performing the following T = task T1 = sum(T.execution-time) for all T where T.status=failed and T.stage-attempt-number = 0 T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0 Tall = sum(T.execution-time) Retry% = (T1 + T2) / Tall The assumption is that T1 – IF a stage is executing for the first time then only tasks that failed was waste T2 – every task executed for a stage with stage-attempt-number > 0 is a retry since the stage was succeeded previously