RE: why the pyspark RDD API is so slow?

Theodore J Griesenbrock Sun, 30 Jan 2022 10:52:00 -0800

Any particular code sample you can suggest to review on your tips?

> On Jan 30, 2022, at 06:16, Sebastian Piu <sebastian....@gmail.com> wrote:
> 
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a 
> spun python worker, so there is additional overhead than if you stay fully in 
> scala. 
> 
> Your python code might make this worse too, for example if not yielding from 
> operations
> 
> You can look at using UDFs and arrow or trying to stay as much as possible on 
> datagrams operations only
> 
>> On Sun, 30 Jan 2022, 10:11 Bitfox, <bit...@bitfox.top> wrote:
>> Hello list,
>> 
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure 
>> scala program. The result shows the pyspark RDD is too slow.
>> 
>> For the operations and dataset please see:
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>> 
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>> 
>> Thanks a lot.
>> 
>> 
>> program      time
>> scala program        49s
>> pyspark dataframe    56s
>> scala RDD    1m31s
>> pyspark RDD  7m15s

RE: why the pyspark RDD API is so slow?

Reply via email to