[ https://issues.apache.org/jira/browse/SPARK-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026989#comment-14026989 ]
Cheng Lian commented on SPARK-2094: ----------------------------------- The basic idea is illustrated in {{CacheCommandPhysical}}: https://github.com/apache/spark/pull/1038/files#diff-726d84ece1e6f6197b98a5868c881ac7R68 > Ensure exactly once semantics for DDL / Commands > ------------------------------------------------ > > Key: SPARK-2094 > URL: https://issues.apache.org/jira/browse/SPARK-2094 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Michael Armbrust > Fix For: 1.1.0 > > > From [~lian cheng]... > The constraints presented here are: > * The side effect of a command SchemaRDD should take place eagerly; > * The side effect of a command SchemaRDD should take place once and only > once; > * When .collect() method is called, something meaningful, usually the output > message lines of the command, should be presented. > Then how about adding a lazy field inside all the physical command nodes to > wrap up the side effect and hold the command output? Take the > SetCommandPhysical as an example: > {code} > trait PhysicalCommand(@transient context: SQLContext) { > lazy val commandOutput: Any > } > case class SetCommandPhysical( > key: Option[String], value: Option[String], output: Seq[Attribute])( > @transient context: SQLContext) > extends PhysicalCommand(context) > with PhysicalCommand { > override lazy val commandOutput = { > // Perform the side effect, and record appropriate output > ??? > } > def execute(): RDD[Row] = { > val row = new GenericRow(Array[Any](commandOutput)) > context.sparkContext.parallelize(row, 1) > } > } > {code} > In this way, all the constraints are met: > * Eager evaluation: done by the toRdd call in SchemaRDDLike (PR #948), > * Side effect should take place once and only once: ensured by the lazy > commandOutput field, > * Present meaningful output as RDD contents: command output is held by > commandOutput and returned in execute(). > An additional benefit is that, side effect logic of all the commands can be > implemented within their own physical command nodes, instead of adding > special cases inside SQLContext.toRdd and/or HiveContext.toRdd. -- This message was sent by Atlassian JIRA (v6.2#6252)