subject:"RE\: why the pyspark RDD API is so slow\?"

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Sebastian Piu

When you operate on a dataframe from the python side you are just invoking methods in the JVM via a proxy (py4j) so it is almost as coding in java itself. This is as long as you don't define any udf's or any other code that needs to invoke python for processing Check the High Performance Spark boo

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Bitfox

Hi In PySpark, RDD need serialised/deserialised, but dataframe doesn’t? Why? Thanks On Mon, Jan 31, 2022 at 4:46 PM Khalid Mammadov wrote: > Your scala program does not use any Spark API hence faster that others. If > you write the same code in pure Python I think it will be even faster than >

Re: why the pyspark RDD API is so slow?

2022-01-31 Thread Khalid Mammadov

Your scala program does not use any Spark API hence faster that others. If you write the same code in pure Python I think it will be even faster than Scala program, especially taking into account these 2 programs runs on a single VM. Regarding Dataframe and RDD I would suggest to use Dataframes an

RE: why the pyspark RDD API is so slow?

2022-01-30 Thread Theodore J Griesenbrock

Any particular code sample you can suggest to review on your tips? > On Jan 30, 2022, at 06:16, Sebastian Piu wrote: > > > This Message Is From an External Sender > This message came from outside your organization. > It's because all data needs to be pickled back and forth between java and a

Re: why the pyspark RDD API is so slow?

2022-01-30 Thread Sebastian Piu

It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala. Your python code might make this worse too, for example if not yielding from operations You can look at using UDFs and arrow or trying t

Re: why the pyspark RDD API is so slow?

Re: why the pyspark RDD API is so slow?

Re: why the pyspark RDD API is so slow?

RE: why the pyspark RDD API is so slow?

Re: why the pyspark RDD API is so slow?

5 matches

Site Navigation

Mail list logo

Footer information