Something like this, maybe?
import org.apache.spark.sql.Dataset import org.apache.spark.sql.catalyst.expressions.AttributeReference import org.apache.spark.sql.execution.LogicalRDD import org.apache.spark.sql.catalyst.encoders.RowEncoder val df: DataFrame = ??? val spark = df.sparkSession val rddOfInternalRows = df.queryExecution.toRdd.mapPartitions(iter => { log.info("Test") iter }) val attributes = df.schema.map(f => AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() ) val logicalPlan = LogicalRDD(attributes, rddOfInternalRows)(spark) val rowEncoder = RowEncoder(df.schema) val resultingDataFrame = new Dataset[Row](spark, logicalPlan, rowEncoder) resultingDataFrame On Mon, Aug 14, 2017 at 2:15 PM, Lukas Bradley <lukasbrad...@gmail.com> wrote: > We have had issues with gathering status on long running jobs. We have > attempted to draw parallels between the Spark UI/Monitoring API and our > code base. Due to the separation between code and the execution plan, even > having a guess as to where we are in the process is difficult. The > Job/Stage/Task information is too abstracted from our code to be easily > digested by non Spark engineers on our team. > > Is there a "hook" to which I can attach a piece of code that is triggered > when a point in the plan is reached? This could be when a SQL command > completes, or when a new DataSet is created, anything really... > > It seems Dataset.checkpoint() offers an excellent snapshot position during > execution, but I'm concerned I'm short-circuiting the optimal execution of > the full plan. I really want these trigger functions to be completely > independent of the actual processing itself. I'm not looking to extract > information from a Dataset, RDD, or anything else. I essentially want to > write independent output for status. > > If this doesn't exist, is there any desire on the dev team for me to > investigate this feature? > > Thank you for any and all help. >