Yes, in the sense that any transformation that can be expressed in the
SQL-like DataFrame API will push down to the JVM, and take advantage of
other optimizations, avoiding the data movement to/from Python and more.
But you can't do this if you're expressing operations that are not in the
DataFrame API, custom logic. They are not always alternatives.

There, pandas UDFs are a better choice in python as you can take advantage
of arrow for data movement, and that is also a reason to use DataFrames in
a case like this. It still has to execute code in Python though.

On Fri, Feb 4, 2022 at 3:20 AM Bitfox <bit...@bitfox.top> wrote:

> Please see my this test:
>
> https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
>
> Don’t use Python RDD, using dataframe instead.
>
> Regards
>
> On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar <hinko.koce...@ess.eu.invalid>
> wrote:
>
>> I'm looking into using Python interface with Spark and came across this
>> [1] chart showing some performance hit when going with Python RDD. Data is
>> ~ 7 years and for older version of Spark. Is this still the case with more
>> recent Spark releases?
>>
>> I'm trying to understand what to expect from Python and Spark and under
>> what conditions.
>>
>> [1]
>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>
>> Thanks,
>> //hinko
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to