On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl <a...@whisperstream.com> wrote: > In pyspark, when I convert from rdds to dataframes it looks like the rdd is > being materialized/collected/repartitioned before it's converted to a > dataframe.
It's not true. When converting a RDD to dataframe, it only take a few of rows to infer the types, no other collect/repartition will happen. > Just wondering if there's any guidelines for doing this conversion and > whether it's best to do it early to get the performance benefits of > dataframes or weigh that against the size/number of items in the rdd. It's better to do it as early as possible, I think. > Thanks, > > -Axel > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org