Hi everybody, this is my first message to the mailing list. In a world of DataFrames and Structured Streaming my use cases may be considered kind of corner cases, but still I think it's important to address such problems and go deep in understanding how Spark RDDs work.
We have an application that synthesizes complex spark jobs, using the RDD interface, easily reaching hundreds of RDDs and stages for each single job. Due to the nature of the functions passed to mapPartitions & co, we need precise control on the characteristics of the job Spark constructs when an action is invoked on the last RDD, like what is shuffled and what is not, the number of stages, fine tuning of RDD persistence, the partitioning and so on. For example we have unit tests to make sure the number of stages that composes the job synthesized for a given use case doesn't increase to detect unforeseen shuffles that we would consider a regression. Problem is that it's difficult to "introspect" the job produced by Spark. Even counting the number of stages is complex, there isn't any simple stageId associated to each RDD. Today we're doing it indirectly by registering a listener and monitoring the events when the action is invoked, but I would prefer doing it in a more direct way, like checking the properties of an RDD. So another approach could be analyzing, recursively, the dependencies of the RDD where the action is invoked and counting the number of dependencies that are a ShuffleDependency, making sure each parent RDD is considered only once. Does it make sense? It is a reliable method? Regards, Stefano
