[Structured Streaming]: Are Spark Dataframes mutable in Structured Streaming?

Sheel Pancholi Thu, 16 May 2019 02:56:43 -0700

Tagging mail to hopefully get a quicker response

On Thu 16 May, 2019, 3:08 PM Sheel Pancholi, <sheelst...@gmail.com> wrote:


> Hello,
>
> Along with what I sent before, I want to add that I went over the
> documentation at
> https://github.com/apache/spark/blob/master/docs/structured-streaming-programming-guide.md
>
>
> Here is an excerpt:
>
> [image: Model]
>> <https://github.com/apache/spark/blob/master/docs/img/structured-streaming-example-model.png>
>>
>> Note that Structured Streaming does not materialize the entire table. It
>> reads the latest available data from the streaming data source, processes
>> it incrementally to update the result, and then discards the source data.
>> It only keeps around the minimal intermediate *state* data as required
>> to update the result (e.g. intermediate counts in the earlier example).
>>
> My question is: on one hand, the diagram shows the input table to truly be
> unbounded by constantly letting the data arrive into this "table". But, on
> the other hand, it also says, that it discards the source data. Then, what
> is the meaning of the unbounded table in the diagram above showing
> incremental data arriving and sitting in this unbounded input table!
> Moreover, it also says that it keeps the intermediate data only (i.e. the
> intermediate counts). This is kind of sounding contradictory in my head.
>
> Could you please clarify what is it ultimately supposed to be?
>
> Regards
> Sheel
>
> On Thu, May 16, 2019 at 2:44 PM Sheel Pancholi <sheelst...@gmail.com>
> wrote:
>
>> Hello Russell,
>>
>> Thanks for clarifying. I went over the Catalyst Optimizer Deep Dive video
>> at https://www.youtube.com/watch?v=RmUn5vHlevc and that along with your
>> explanation made me realize that the the DataFrame is the new DStream in
>> Structured Streaming. If my understanding is correct, request you to
>> clarify the 2 points below:
>>
>> 1. Incremental Query - Say at time instant T1 you have 10 items to
>> process, and then at time instant T2 you have 5 newer items streaming in to
>> be processed. *Structured Streaming* says that the DF is treated as an
>> unbounded table and hence 15 items will be processed together. Does this
>> mean that on Iteration 1 (i.e. time instant T1) the Catalyst Optimizer in
>> the code generation phase creates an RDD of 10 elements and on Iteration 2
>> ( i.e. time instant T2 ), the Catalyst Optimizer creates an RDD of 15
>> elements?
>> 2. You mentioned *"Some parts of the plan refer to static pieces of
>> data  ..."*  Could you elaborate a bit more on what does this static
>> piece of data refer to? Are you referring to the 10 records that had
>> already arrived at T1 and are now sitting as old static data in the
>> unbounded table?
>>
>> Regards
>> Sheel
>>
>>
>> On Thu, May 16, 2019 at 3:30 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> Dataframes describe the calculation to be done, but the underlying
>>> implementation is an "Incremental Query". That is that the dataframe code
>>> is executed repeatedly with Catalyst adjusting the final execution plan on
>>> each run. Some parts of the plan refer to static pieces of data, others
>>> refer to data which is pulled in on each iteration. None of this changes
>>> the DataFrame objects themselves.
>>>
>>>
>>>
>>>
>>> On Wed, May 15, 2019 at 1:34 PM Sheel Pancholi <sheelst...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>> Structured Streaming treats a stream as an unbounded table in the form
>>>> of a DataFrame. Continuously flowing data from the stream keeps getting
>>>> added to this DataFrame (which is the unbounded table) which warrants a
>>>> change to the DataFrame which violates the vary basic nature of a DataFrame
>>>> since a DataFrame by its nature is immutable. This sounds contradictory. Is
>>>> there an explanation for this?
>>>>
>>>> Regards
>>>> Sheel
>>>>
>>>
>>
>> --
>>
>> Best Regards,
>>
>> Sheel Pancholi
>>
>> *Mob: +91 9620474620*
>>
>> *Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
>> <http://in.linkedin.com/in/sheelstera>*
>>
>> *Write to me at:Sheel@Yahoo!! <sheelpanch...@yahoo.com> | Sheel@Gmail!!
>> <sheelst...@gmail.com> | Sheel@Windows Live <sheelst...@live.com>*
>>
>> P *Save a tree* - please do not print this email unless you really need
>> to!
>>
>
>
> --
>
> Best Regards,
>
> Sheel Pancholi
>
> *Mob: +91 9620474620*
>
> *Connect with me on: Twitter <https://twitter.com/sheelstera> | LinkedIn
> <http://in.linkedin.com/in/sheelstera>*
>
> *Write to me at:Sheel@Yahoo!! <sheelpanch...@yahoo.com> | Sheel@Gmail!!
> <sheelst...@gmail.com> | Sheel@Windows Live <sheelst...@live.com>*
>
> P *Save a tree* - please do not print this email unless you really need
> to!
>

[Structured Streaming]: Are Spark Dataframes mutable in Structured Streaming?

Reply via email to