I actually asked the same thing a couple of weeks ago.
Apparently, when you create a structured streaming plan, it is different than 
the batch plan and is fixed in order to properly aggregate. If you perform most 
operations on the dataframe it will recalculate the plan as a batch plan and 
will therefore not work properly. Therefore, you must either collect or turn to 
RDD and then create a new dataframe from the RDD.
It would be very useful IMO if we can "freeze" the plan for the input portion 
and work as if it was a new dataframe (similar to turning it to RDD and then 
creating a new dataframe from the RDD but without the overhead of converting to 
RDD and back to dataframe), however, this is not currently possible.

Thanks,
              Assaf.

From: Jacek Laskowski [via Apache Spark Developers List] 
[mailto:ml+s1001551n21930...@n3.nabble.com]
Sent: Friday, July 07, 2017 11:44 AM
To: Mendelson, Assaf
Subject: [SS] Why does ConsoleSink's addBatch convert input DataFrame to show 
it?

Hi,

Just noticed that the input DataFrame is collect'ed and then
parallelize'd simply to show it to the console [1]. Why so many fairly
expensive operations for show?

I'd appreciate some help understanding this code. Thanks.

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/console.scala#L51-L53

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden 
email]</user/SendEmail.jtp?type=node&node=21930&i=0>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/SS-Why-does-ConsoleSink-s-addBatch-convert-input-DataFrame-to-show-it-tp21930.html
To start a new topic under Apache Spark Developers List, email 
ml+s1001551n1...@n3.nabble.com<mailto:ml+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RE-SS-Why-does-ConsoleSink-s-addBatch-convert-input-DataFrame-to-show-it-tp21931.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Reply via email to