Any particular code sample you can suggest to review on your tips?
> On Jan 30, 2022, at 06:16, Sebastian Piu <sebastian....@gmail.com> wrote:
>
>
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a
> spun python worker, so there is additional overhead than if you stay fully in
> scala.
>
> Your python code might make this worse too, for example if not yielding from
> operations
>
> You can look at using UDFs and arrow or trying to stay as much as possible on
> datagrams operations only
>
>> On Sun, 30 Jan 2022, 10:11 Bitfox, <bit...@bitfox.top> wrote:
>> Hello list,
>>
>> I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
>> scala program. The result shows the pyspark RDD is too slow.
>>
>> For the operations and dataset please see:
>> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>>
>> The result table is below.
>> Can you give suggestions on how to optimize the RDD operation?
>>
>> Thanks a lot.
>>
>>
>> program time
>> scala program 49s
>> pyspark dataframe 56s
>> scala RDD 1m31s
>> pyspark RDD 7m15s