[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 Any further changes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r209081079 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -580,6 +580,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") --- End diff -- Oh... did it change or I overlooked 'sources'? Sure I will change! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r207815199 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -580,6 +580,11 @@ object SQLConf { .booleanConf .createWithDefault(true) + val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets") +.doc("The maximum number of buckets allowed. Defaults to 10") +.longConf --- End diff -- I was following the convention used in config entries, where integrals use longConf, without making further changes. However I agree we could update the class type as well to match. Will submit the patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 It seems the tests timed-out. Any chance to re-run them? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 It would be great if some admin could review. If there is anything to improve please tell. It is very simple though. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/21087 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
GitHub user ferdonline opened a pull request: https://github.com/apache/spark/pull/21087 [SPARK-23997][SQL] Configurable maximum number of buckets ## What changes were proposed in this pull request? This PR implements the possibility of the user to override the maximum number of buckets when saving to a table. Currently the limit is a hard-coded 100k, which might be insufficient for large workloads. A new configuration entry is proposed: `spark.sql.bucketing.maxBuckets`, which defaults to the previous 100k. ## How was this patch tested? Added unit tests in the following spark.sql test suites: - CreateTableAsSelectSuite - BucketedWriteSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/ferdonline/spark enh/configurable_bucket_limit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21087 commit 61a476fe1f90b2e4c8ddbf82024f8116d737d2ef Author: Fernando Pereira <fernando.pereira@...> Date: 2018-04-17T12:53:59Z Adding configurable max buckets commit a8846568db9eb63095c9dc55e8b71906ff95e6b0 Author: Fernando Pereira <fernando.pereira@...> Date: 2018-04-17T15:22:18Z fixing tests in spark.sql --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20269: [SPARK-23029] [DOCS] Specifying default units of configu...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/20269 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20269: [SPARK-23029] [DOCS] Specifying default units of configu...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/20269 Hi. Thanks for your review. Sounds good, I will go around and add a "unit blurb" to them. I wrote "Default unit: X" to keep it the shortest and very obvious, but I agree to have nicer english in the html docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20269: [SPARK-23029] [DOCS] Specifying default units of ...
GitHub user ferdonline opened a pull request: https://github.com/apache/spark/pull/20269 [SPARK-23029] [DOCS] Specifying default units of configuration entries ## What changes were proposed in this pull request? This PR completes the docs, specifying the default units assumed in configuration entries of type size. This is crucial since unit-less values are accepted and the user might assume the base unit is bytes, which in most cases it is not, leading to hard-to-debug problems. ## How was this patch tested? This patch updates only documentation only. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ferdonline/spark docs_units Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20269.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20269 commit 889426b191f1f542012e0a5c9f0a121f64a2b46e Author: Fernando Pereira <fernando.pereira@...> Date: 2018-01-15T13:56:23Z Specifying default units of configuration entries --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r157371948 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -540,9 +540,52 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, reliableCheckpoint = true) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = checkpoint(eager = true, reliableCheckpoint = false) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint( +eager = eager, +reliableCheckpoint = false + ) + + /** + * Returns a checkpointed version of this Dataset. + * + * @param eager Whether to checkpoint this dataframe immediately + * @param reliableCheckpoint Whether to create a reliable checkpoint saved to files inside the + * checkpoint directory. If false creates a local checkpoint using + * the caching subsystem + */ + private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = { val internalRdd = queryExecution.toRdd.map(_.copy()) -internalRdd.checkpoint() +if (reliableCheckpoint) { + internalRdd.checkpoint() +} else { + internalRdd.localCheckpoint() --- End diff -- Hi. Thanks for the review. From the point of view of the user being aware he's doing a local checkpoint we already force him to use localCheckpoint() (the generic checkpoint is private) If we should warn users about the potential issues with localCheckpoint() shouldn't we do it in the RDD API, so that users are always warned? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 Existing checkpoint tests applied to localCheckpoint as well, all working well. Please verify. I'm getting an unrelated fail in SparkR, did anything change n the build system? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r154495320 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +537,48 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { --- End diff -- I can try to create a test to localCheckpoint based on the one for checkpoint, but I'm not very familiar with Scala and the Spark scala API, so currently I don't feel at ease to create a meaningful test. Would anybody be up to add one? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153802577 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) + + /** + * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the + * logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. + * By default reliable checkpoints are created and saved to files inside the checkpoint + * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint + * is performed instead. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = { --- End diff -- Alright, so I remove both default values, otherwise we end up with a colliding signature. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153746129 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) + + /** + * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the + * logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. + * By default reliable checkpoints are created and saved to files inside the checkpoint + * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint + * is performed instead. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = { --- End diff -- Sounds good to me to remove default param values, keeping private. Let me know --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153722443 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -537,9 +536,55 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be + * used to truncate the logical plan of this Dataset, which is especially useful in iterative + * algorithms where the plan may grow exponentially. Local checkpoints are written to executor + * storage and despite potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = true) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) + + /** + * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the + * logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. + * By default reliable checkpoints are created and saved to files inside the checkpoint + * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint + * is performed instead. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + * @since 2.3.0 + */ + @Experimental + @InterfaceStability.Evolving + private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = { --- End diff -- I changed according to the first review. I also agree to be private so that "there's only one way of doing it" when using the API. But if you have a strong feeling I can surely change --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 restest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153147567 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -524,22 +524,41 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(): Dataset[T] = checkpoint(eager = true) + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) + + /** + * Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate + * the logical plan of this Dataset, which is especially useful in iterative algorithms where the + * plan may grow exponentially. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. + * + * @group basic + */ + @Experimental + @InterfaceStability.Evolving + def localCheckpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager, local = true) /** * Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the * logical plan of this Dataset, which is especially useful in iterative algorithms where the - * plan may grow exponentially. It will be saved to files inside the checkpoint - * directory set with `SparkContext#setCheckpointDir`. + * plan may grow exponentially. + * By default reliable checkpoints are created and saved to files inside the checkpoint + * directory set with `SparkContext#setCheckpointDir`. If local is set to true a local checkpoint + * is performed instead. Local checkpoints are written to executor storage and despite + * potentially faster they are unreliable and may compromise job completion. * * @group basic * @since 2.1.0 */ @Experimental @InterfaceStability.Evolving - def checkpoint(eager: Boolean): Dataset[T] = { + def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = { --- End diff -- Sounds good. Regarding the naming, I'm not sure adding an underscore is a good choice. Let me know if it should be changed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...
Github user ferdonline commented on a diff in the pull request: https://github.com/apache/spark/pull/19805#discussion_r153145177 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -524,22 +524,41 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def checkpoint(): Dataset[T] = checkpoint(eager = true) + def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = eager) --- End diff -- I always try to avoid duplication of code, and with docs this takes ~10 lines for nothing - I believe a function with a default parameter is as readable as the function without the parameter. But please let me know if it goes against the code style. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 Thanks for you review. I'm working on the changes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 cc @andrewor14 since I believe you know this part of spark pretty well, maybe you could help me integrating this. Any idea why Jenkins didn't start the testing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19805: Adding localCheckpoint to Dataframe API
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/19805 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19805: Adding localCheckpoint to Dataframe API
GitHub user ferdonline opened a pull request: https://github.com/apache/spark/pull/19805 Adding localCheckpoint to Dataframe API ## What changes were proposed in this pull request? This change adds local checkpoint support to datasets and respective bind from Python Dataframe API. If reliability requirements can be lowered to favor performance, as in cases of further quick transformations followed by a reliable save, localCheckpoints() fit very well. Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428) In general it makes the API more complete as well. ## How was this patch tested? Python land quick use case: ```python In [1]: from time import sleep In [2]: from pyspark.sql import types as T In [3]: from pyspark.sql import functions as F In [4]: def f(x): sleep(1) return x*2 ...: In [5]: df1 = spark.range(30, numPartitions=6) In [6]: df2 = df1.select(F.udf(f, T.LongType())("id")) In [7]: %time _ = df2.collect() CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms Wall time: 12.2 s In [8]: %time df3 = df2.localCheckpoint() CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms Wall time: 10.3 s In [9]: %time _ = df3.collect() CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms Wall time: 148 ms ``` Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ferdonline/spark feature_dataset_localCheckpoint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19805.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19805 commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb Author: Fernando Pereira <fernando.pere...@epfl.ch> Date: 2017-11-23T18:49:37Z Adding localCheckpoint to Dataframe API --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/9428 That's the reason why I want to checkpoint when they are first calculated. Further transformations use these results several times. Of course it's not a problem per se to calculate twice for the checkpoint, but doing so for 1+TB of data is nonsense and I can't cache. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...
Github user ferdonline commented on the issue: https://github.com/apache/spark/pull/9428 Hello. I find this feature to be really important and I would be happy to contribute here. Even though we would potentially not support every use case, it would already be great if in the majority of cases we could avoid the double computation, while in other cases we raise a warning saying that computation is gonna happen twice. This is specially important for a use case I have where a transformation creates random numbers, so I simply cant recompute things as results will be different. So in my case the only option to break lineage seems to be a full write() followed by read(). Any plans to have it in eager checkpoints at least? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org