I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take automatically?
My application has a slightly interesting DAG (re-use of functions that contain Spark transformations, persistent RDDs). Not that complex, but not 'step 1, step 2, step 3'. I'm guessing that if the driver program runs sequentially sending messages to Spark, then Spark has no knowledge of the structure of the driver program. Therefore it's necessary to execute it on a small test dataset and see how many stages result? When I set spark.eventLog.enabled = true and run on (very small) test data I don't get any stage messages in my STDOUT or in the log file. This is on a `local` instance. Did I miss something obvious? Thanks! Joe