Hey Everyone! I've been converting between Parquet <-> Spark Data Frames <-> R Data Frames for larger data sets. I have found the conversion speed quite slow in the Spark <-> R side and am looking for some insight on how to speed it up (or determine what I have failed to do properly)!
In R, "sparkR::collect" and "sparkR::write.df" take much longer than Spark reading and writing Parquet. While these aren’t the same operations, the difference suggests that there is a bottleneck within the translation between R data frames and Spark Data Frames. A profile of the SparkR code shows that R is spending a large portion of its time within "sparkR:::readTypedObject", "sparkR:::readBin", and "sparkR:::readObject". To me, this suggests that the serialization step accounts for the slow speed, but I don't want to guess too much. Any thoughts on how to speed the conversion? Details: Tried with Spark 2.0 and 1.6.1 (and the associated SparkR package) and R 3.3.0. On a Macbook Pro, 16BG Ram, Quad Core. +--------+-----------+-----------------+------------------+ | # Rows | # Columns | sparkR::collect | sparkR::write.df | +--------+-----------+-----------------+------------------+ | 600K | 20 | 3min | 6min | +--------+-----------+-----------------+------------------+ | 1.8M | 20 | 9min | 20min | +--------+-----------+-----------------+------------------+ | 600K | 1 | 40 sec | 4min | +--------+-----------+-----------------+------------------+ Thanks! Jonathan --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org