Input datasets which represent a input data stream only supports appending of new rows, as the stream is modeled as an unbounded table where new data in the stream are new rows being appended to the table. For transformed datasets generated from the input dataset, rows can be updated and removed as input dataset has added rows. To take a concrete example, if you are maintaining a running word count dataset, every time the input dataset has new rows, the rows of the word count dataset will get updated. Where this really matters is when the transformed data needs written to a output sink and that's where the output modes decide how the updated/deleted rows are written to the sink. Currently, Spark 2.0 will support only the Complete Mode, where after any update, ALL the rows (i.e. added, updated, and unchanged rows) of the word count dataset will be given to the sink for output. Future version of Spark will have the Update mode, where only the added/updated rows will be given to the sink.
On Mon, Jul 4, 2016 at 8:23 AM, Arnaud Bailly <arnaud.oq...@gmail.com> wrote: > Hello, > > I am interested in using the new Structured Streaming feature of Spark SQL > and am currently doing some experiments on code at HEAD. I would like to > have a better understanding of how deletion should be handled in a > structured streaming setting. > > Given some incremental query computing an arbitrary aggregation over some > dataset, inserting new values is somewhat obvious: simply update the > aggregate computation tree with whatever new values are added to input > datasets/datastreams. But things are not so obvious for updates and > deletions: do they have a representation in the input datastreams? If I > have a query that aggregates some value over some key, and I delete all > instances of that key, I would expect the query to output a result removing > the key's aggregated value. The same is true for updates... > > Thanks for any insights you might want to share. > > Regards, > -- > Arnaud Bailly > > twitter: abailly > skype: arnaud-bailly > linkedin: http://fr.linkedin.com/in/arnaudbailly/ >