When you operate on a dataframe from the python side you are just invoking
methods in the JVM via a proxy (py4j) so it is almost as coding in java
itself. This is as long as you don't define any udf's or any other code
that needs to invoke python for processing
Check the High Performance Spark boo
Hi
In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why?
Thanks
On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov
wrote:
> Your scala program does not use any Spark API hence faster that others. If
> you write the same code in pure Python I think it will be even faster than
>
Your scala program does not use any Spark API hence faster that others. If
you write the same code in pure Python I think it will be even faster than
Scala program, especially taking into account these 2 programs runs on a
single VM.
Regarding Dataframe and RDD I would suggest to use Dataframes an
Any particular code sample you can suggest to review on your tips?
> On Jan 30, 2022, at 06:16, Sebastian Piu wrote:
>
>
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a
It's because all data needs to be pickled back and forth between java and a
spun python worker, so there is additional overhead than if you stay fully
in scala.
Your python code might make this worse too, for example if not yielding
from operations
You can look at using UDFs and arrow or trying t