Thank you, TD. Couple of follow up questions please. 1) "It only keeps around the minimal intermediate state data"
How do you define "minimal" here? Is there a configuration property to control the time or size of Streaming Dataframe? 2) I'm not writing anything out to any database or S3. My requirement is to find out a count (real-time) in a 1 hour window. I would like to get this count from a BI tool. So can register as a temp view and access from BI tool? I tried something like this In my Streaming application.... AggStreamingDF.createOrReplaceGlobalTempView("streaming_table") Then, In BI tool, I queried like this... select * from streaming_table Error: Queries with streaming sources must be executed with writeStream.start() Any suggestions to make this work? Thank you very much for your help! On Tue, Aug 27, 2019, 6:42 PM Tathagata Das <tathagata.das1...@gmail.com> wrote: > > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts > > *Note that Structured Streaming does not materialize the entire table*. >> It reads the latest available data from the streaming data source, >> processes it incrementally to update the result, and then discards the >> source data. It only keeps around the minimal intermediate *state* data >> as required to update the result (e.g. intermediate counts in the earlier >> example). >> > > > On Tue, Aug 27, 2019 at 1:21 PM Nick Dawes <nickdawe...@gmail.com> wrote: > >> I have a quick newbie question. >> >> Spark Structured Streaming creates an unbounded dataframe that keeps >> appending rows to it. >> >> So what's the max size of data it can hold? What if the size becomes >> bigger than the JVM? Will it spill to disk? I'm using S3 as storage. So >> will it write temp data on S3 or on local file system of the cluster? >> >> Nick >> >