Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

Davies Liu Thu, 02 Jul 2015 13:22:37 -0700

On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl <a...@whisperstream.com> wrote:
> In pyspark, when I convert from rdds to dataframes it looks like the rdd is
> being materialized/collected/repartitioned before it's converted to a
> dataframe.


It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.

> Just wondering if there's any guidelines for doing this conversion and
> whether it's best to do it early to get the performance benefits of
> dataframes or weigh that against the size/number of items in the rdd.

It's better to do it as early as possible, I think.

> Thanks,
>
> -Axel
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

Reply via email to