Re: Structured Streaming Dataframe Size

Nick Dawes Wed, 28 Aug 2019 08:43:18 -0700

Thank you, TD. Couple of follow up questions please.

1) "It only keeps around the minimal intermediate state data"

How do you define "minimal" here? Is there a configuration property to
control the time or size of Streaming Dataframe?

2) I'm not writing anything out to any database or S3. My requirement is to
find out a count (real-time) in a 1 hour window. I would like to get this
count from a BI tool. So can register as a temp view and access from BI
tool?

I tried something like this In my Streaming application....

AggStreamingDF.createOrReplaceGlobalTempView("streaming_table")

Then, In BI tool, I queried like this...

select * from streaming_table

Error:  Queries with streaming sources must be executed with
writeStream.start()

Any suggestions to make this work?

Thank you very much for your help!

On Tue, Aug 27, 2019, 6:42 PM Tathagata Das <tathagata.das1...@gmail.com>
wrote:

>
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts
>
> *Note that Structured Streaming does not materialize the entire table*.
>> It reads the latest available data from the streaming data source,
>> processes it incrementally to update the result, and then discards the
>> source data. It only keeps around the minimal intermediate *state* data
>> as required to update the result (e.g. intermediate counts in the earlier
>> example).
>>
>
>
> On Tue, Aug 27, 2019 at 1:21 PM Nick Dawes <nickdawe...@gmail.com> wrote:
>
>> I have a quick newbie question.
>>
>> Spark Structured Streaming creates an unbounded dataframe that keeps
>> appending rows to it.
>>
>> So what's the max size of data it can hold? What if the size becomes
>> bigger than the JVM? Will it spill to disk? I'm using S3 as storage. So
>> will it write temp data on S3 or on local file system of the cluster?
>>
>> Nick
>>
>

Re: Structured Streaming Dataframe Size

Reply via email to