GitHub user ferdonline opened a pull request:

    https://github.com/apache/spark/pull/19805

    Adding localCheckpoint to Dataframe API

    ## What changes were proposed in this pull request?
    
    This change adds local checkpoint support to datasets and respective bind 
from Python Dataframe API.
    
    If reliability requirements can be lowered to favor performance, as in 
cases of further quick transformations followed by a reliable save, 
localCheckpoints() fit very well. 
    Furthermore, at the moment Reliable checkpoints still incur double 
computation (see #9428)
    In general it makes the API more complete as well.
    
    ## How was this patch tested?
    
    Python land quick use case:
    
    ```python
    In [1]: from time import sleep
    
    In [2]: from pyspark.sql import types as T
    
    In [3]: from pyspark.sql import functions as F
    
    In [4]: def f(x):
        sleep(1)
        return x*2
       ...: 
    
    In [5]: df1 = spark.range(30, numPartitions=6)
    
    In [6]: df2 = df1.select(F.udf(f, T.LongType())("id"))
    
    In [7]: %time _ = df2.collect()
    CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms                       
    
    Wall time: 12.2 s
    
    In [8]: %time df3 = df2.localCheckpoint()
    CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms                        
    
    Wall time: 10.3 s
    
    In [9]: %time _ = df3.collect()
    CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms
    Wall time: 148 ms
    ```
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ferdonline/spark 
feature_dataset_localCheckpoint

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19805
    
----
commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb
Author: Fernando Pereira <fernando.pere...@epfl.ch>
Date:   2017-11-23T18:49:37Z

    Adding localCheckpoint to Dataframe API

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to