GitHub user ferdonline opened a pull request: https://github.com/apache/spark/pull/19805
Adding localCheckpoint to Dataframe API ## What changes were proposed in this pull request? This change adds local checkpoint support to datasets and respective bind from Python Dataframe API. If reliability requirements can be lowered to favor performance, as in cases of further quick transformations followed by a reliable save, localCheckpoints() fit very well. Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428) In general it makes the API more complete as well. ## How was this patch tested? Python land quick use case: ```python In [1]: from time import sleep In [2]: from pyspark.sql import types as T In [3]: from pyspark.sql import functions as F In [4]: def f(x): sleep(1) return x*2 ...: In [5]: df1 = spark.range(30, numPartitions=6) In [6]: df2 = df1.select(F.udf(f, T.LongType())("id")) In [7]: %time _ = df2.collect() CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms Wall time: 12.2 s In [8]: %time df3 = df2.localCheckpoint() CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms Wall time: 10.3 s In [9]: %time _ = df3.collect() CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms Wall time: 148 ms ``` Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ferdonline/spark feature_dataset_localCheckpoint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19805.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19805 ---- commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb Author: Fernando Pereira <fernando.pere...@epfl.ch> Date: 2017-11-23T18:49:37Z Adding localCheckpoint to Dataframe API ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org