[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-28 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
Any further changes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-09 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r209081079
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -580,6 +580,11 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
--- End diff --

Oh... did it change or I overlooked 'sources'? Sure I will change!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-06 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-06 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r207815199
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -580,6 +580,11 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val BUCKETING_MAX_BUCKETS = buildConf("spark.sql.bucketing.maxBuckets")
+.doc("The maximum number of buckets allowed. Defaults to 10")
+.longConf
--- End diff --

I was following the convention used in config entries, where integrals use 
longConf, without making further changes. However I agree we could update the 
class type as well to match. Will submit the patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-03 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
It seems the tests timed-out. Any chance to re-run them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-02 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-08-01 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
It would be great if some admin could review. If there is anything to 
improve please tell. It is very simple though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21087: [SPARK-23997][SQL] Configurable maximum number of bucket...

2018-04-23 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/21087
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-04-17 Thread ferdonline
GitHub user ferdonline opened a pull request:

https://github.com/apache/spark/pull/21087

[SPARK-23997][SQL] Configurable maximum number of buckets

## What changes were proposed in this pull request?
This PR implements the possibility of the user to override the maximum 
number of buckets when saving to a table. 
Currently the limit is a hard-coded 100k, which might be insufficient for 
large workloads.
A new configuration entry is proposed: `spark.sql.bucketing.maxBuckets`, 
which defaults to the previous 100k.

## How was this patch tested?
Added unit tests in the following spark.sql test suites:

- CreateTableAsSelectSuite
- BucketedWriteSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ferdonline/spark enh/configurable_bucket_limit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21087


commit 61a476fe1f90b2e4c8ddbf82024f8116d737d2ef
Author: Fernando Pereira <fernando.pereira@...>
Date:   2018-04-17T12:53:59Z

Adding configurable max buckets

commit a8846568db9eb63095c9dc55e8b71906ff95e6b0
Author: Fernando Pereira <fernando.pereira@...>
Date:   2018-04-17T15:22:18Z

fixing tests in spark.sql




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20269: [SPARK-23029] [DOCS] Specifying default units of configu...

2018-01-16 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/20269
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20269: [SPARK-23029] [DOCS] Specifying default units of configu...

2018-01-16 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/20269
  
Hi. Thanks for your review. Sounds good, I will go around and add a "unit 
blurb" to them.
I wrote "Default unit: X" to keep it the shortest and very obvious, but I 
agree to have nicer english in the html docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20269: [SPARK-23029] [DOCS] Specifying default units of ...

2018-01-15 Thread ferdonline
GitHub user ferdonline opened a pull request:

https://github.com/apache/spark/pull/20269

[SPARK-23029] [DOCS] Specifying default units of configuration entries

## What changes were proposed in this pull request?
This PR completes the docs, specifying the default units assumed in 
configuration entries of type size.
This is crucial since unit-less values are accepted and the user might 
assume the base unit is bytes, which in most cases it is not, leading to 
hard-to-debug problems.

## How was this patch tested?
This patch updates only documentation only.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ferdonline/spark docs_units

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20269.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20269


commit 889426b191f1f542012e0a5c9f0a121f64a2b46e
Author: Fernando Pereira <fernando.pereira@...>
Date:   2018-01-15T13:56:23Z

Specifying default units of configuration entries




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

2017-12-17 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r157371948
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -540,9 +540,52 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean): Dataset[T] = checkpoint(eager = eager, 
reliableCheckpoint = true)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = checkpoint(eager = true, 
reliableCheckpoint = false)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean): Dataset[T] = checkpoint(
+eager = eager,
+reliableCheckpoint = false
+  )
+
+  /**
+   * Returns a checkpointed version of this Dataset.
+   *
+   * @param eager Whether to checkpoint this dataframe immediately
+   * @param reliableCheckpoint Whether to create a reliable checkpoint 
saved to files inside the
+   *   checkpoint directory. If false creates a 
local checkpoint using
+   *   the caching subsystem
+   */
+  private def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): 
Dataset[T] = {
 val internalRdd = queryExecution.toRdd.map(_.copy())
-internalRdd.checkpoint()
+if (reliableCheckpoint) {
+  internalRdd.checkpoint()
+} else {
+  internalRdd.localCheckpoint()
--- End diff --

Hi. Thanks for the review.
From the point of view of the user being aware he's doing a local 
checkpoint we already force him to use localCheckpoint() (the generic 
checkpoint is private)
If we should warn users about the potential issues with localCheckpoint() 
shouldn't we do it in the RDD API, so that users are always warned?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint to Dat...

2017-12-12 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
Existing checkpoint tests applied to localCheckpoint as well, all working 
well. Please verify.
I'm getting an unrelated fail in SparkR, did anything change n the build 
system?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

2017-12-02 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r154495320
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +537,48 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
--- End diff --

I can try to create a test to localCheckpoint based on the one for 
checkpoint, but I'm not very familiar with Scala and the Spark scala API, so 
currently I don't feel at ease to create a meaningful test. Would anybody be up 
to add one?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

2017-11-29 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153802577
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = 
true)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
+
+  /**
+   * Returns a checkpointed version of this Dataset. Checkpointing can be 
used to truncate the
+   * logical plan of this Dataset, which is especially useful in iterative 
algorithms where the
+   * plan may grow exponentially.
+   * By default reliable checkpoints are created and saved to files inside 
the checkpoint
+   * directory set with `SparkContext#setCheckpointDir`. If local is set 
to true a local checkpoint
+   * is performed instead. Local checkpoints are written to executor 
storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): 
Dataset[T] = {
--- End diff --

Alright, so I remove both default values, otherwise we end up with a 
colliding signature. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [SPARK-22649][PYTHON][SQL] Adding localCheckpoint...

2017-11-29 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153746129
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = 
true)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
+
+  /**
+   * Returns a checkpointed version of this Dataset. Checkpointing can be 
used to truncate the
+   * logical plan of this Dataset, which is especially useful in iterative 
algorithms where the
+   * plan may grow exponentially.
+   * By default reliable checkpoints are created and saved to files inside 
the checkpoint
+   * directory set with `SparkContext#setCheckpointDir`. If local is set 
to true a local checkpoint
+   * is performed instead. Local checkpoints are written to executor 
storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): 
Dataset[T] = {
--- End diff --

Sounds good to me to remove default param values, keeping private. Let me 
know


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-29 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153722443
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -537,9 +536,55 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Eagerly locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be
+   * used to truncate the logical plan of this Dataset, which is 
especially useful in iterative
+   * algorithms where the plan may grow exponentially. Local checkpoints 
are written to executor
+   * storage and despite potentially faster they are unreliable and may 
compromise job completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(): Dataset[T] = _checkpoint(eager = true, local = 
true)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
+
+  /**
+   * Returns a checkpointed version of this Dataset. Checkpointing can be 
used to truncate the
+   * logical plan of this Dataset, which is especially useful in iterative 
algorithms where the
+   * plan may grow exponentially.
+   * By default reliable checkpoints are created and saved to files inside 
the checkpoint
+   * directory set with `SparkContext#setCheckpointDir`. If local is set 
to true a local checkpoint
+   * is performed instead. Local checkpoints are written to executor 
storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   * @since 2.3.0
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  private[sql] def _checkpoint(eager: Boolean, local: Boolean = false): 
Dataset[T] = {
--- End diff --

I changed according to the first review. I also agree to be private so that 
"there's only one way of doing it" when using the API. But if you have a strong 
feeling I can surely change


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-28 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-27 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
restest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-27 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153147567
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -524,22 +524,41 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(): Dataset[T] = checkpoint(eager = true)
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
+
+  /**
+   * Locally checkpoints a Dataset and return the new Dataset. 
Checkpointing can be used to truncate
+   * the logical plan of this Dataset, which is especially useful in 
iterative algorithms where the
+   * plan may grow exponentially. Local checkpoints are written to 
executor storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
+   *
+   * @group basic
+   */
+  @Experimental
+  @InterfaceStability.Evolving
+  def localCheckpoint(eager: Boolean = true): Dataset[T] = 
_checkpoint(eager = eager, local = true)
 
   /**
* Returns a checkpointed version of this Dataset. Checkpointing can be 
used to truncate the
* logical plan of this Dataset, which is especially useful in iterative 
algorithms where the
-   * plan may grow exponentially. It will be saved to files inside the 
checkpoint
-   * directory set with `SparkContext#setCheckpointDir`.
+   * plan may grow exponentially.
+   * By default reliable checkpoints are created and saved to files inside 
the checkpoint
+   * directory set with `SparkContext#setCheckpointDir`. If local is set 
to true a local checkpoint
+   * is performed instead. Local checkpoints are written to executor 
storage and despite
+   * potentially faster they are unreliable and may compromise job 
completion.
*
* @group basic
* @since 2.1.0
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(eager: Boolean): Dataset[T] = {
+  def _checkpoint(eager: Boolean, local: Boolean = false): Dataset[T] = {
--- End diff --

Sounds good. Regarding the naming, I'm not sure adding an underscore is a 
good choice. Let me know if it should be changed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset A...

2017-11-27 Thread ferdonline
Github user ferdonline commented on a diff in the pull request:

https://github.com/apache/spark/pull/19805#discussion_r153145177
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -524,22 +524,41 @@ class Dataset[T] private[sql](
*/
   @Experimental
   @InterfaceStability.Evolving
-  def checkpoint(): Dataset[T] = checkpoint(eager = true)
+  def checkpoint(eager: Boolean = true): Dataset[T] = _checkpoint(eager = 
eager)
--- End diff --

I always try to avoid duplication of code, and with docs this takes ~10 
lines for nothing - I believe a function with a default parameter is as 
readable as the function without the parameter. But please let me know if it 
goes against the code style. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: [PYTHON][SQL] Adding localCheckpoint to Dataset API

2017-11-27 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
Thanks for you review. I'm working on the changes


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: [SQL] Adding localCheckpoint to Dataset API

2017-11-24 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
cc @andrewor14 since I believe you know this part of spark pretty well, 
maybe you could help me integrating this.
Any idea why Jenkins didn't start the testing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19805: Adding localCheckpoint to Dataframe API

2017-11-24 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/19805
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19805: Adding localCheckpoint to Dataframe API

2017-11-23 Thread ferdonline
GitHub user ferdonline opened a pull request:

https://github.com/apache/spark/pull/19805

Adding localCheckpoint to Dataframe API

## What changes were proposed in this pull request?

This change adds local checkpoint support to datasets and respective bind 
from Python Dataframe API.

If reliability requirements can be lowered to favor performance, as in 
cases of further quick transformations followed by a reliable save, 
localCheckpoints() fit very well. 
Furthermore, at the moment Reliable checkpoints still incur double 
computation (see #9428)
In general it makes the API more complete as well.

## How was this patch tested?

Python land quick use case:

```python
In [1]: from time import sleep

In [2]: from pyspark.sql import types as T

In [3]: from pyspark.sql import functions as F

In [4]: def f(x):
sleep(1)
return x*2
   ...: 

In [5]: df1 = spark.range(30, numPartitions=6)

In [6]: df2 = df1.select(F.udf(f, T.LongType())("id"))

In [7]: %time _ = df2.collect()
CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms   

Wall time: 12.2 s

In [8]: %time df3 = df2.localCheckpoint()
CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms

Wall time: 10.3 s

In [9]: %time _ = df3.collect()
CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms
Wall time: 148 ms
```

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ferdonline/spark 
feature_dataset_localCheckpoint

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19805.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19805


commit abe03ab0e8d6647ccb8949a39c431cd845c23dbb
Author: Fernando Pereira <fernando.pere...@epfl.ch>
Date:   2017-11-23T18:49:37Z

Adding localCheckpoint to Dataframe API




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...

2017-11-17 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/9428
  
That's the reason why I want to checkpoint when they are first calculated. 
Further transformations use these results several times. Of course it's not a 
problem per se to calculate twice for the checkpoint, but doing so for 1+TB of 
data is nonsense and I can't cache.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9428: [SPARK-8582][Core]Optimize checkpointing to avoid computi...

2017-11-16 Thread ferdonline
Github user ferdonline commented on the issue:

https://github.com/apache/spark/pull/9428
  
Hello. I find this feature to be really important and I would be happy to 
contribute here. Even though we would potentially not support every use case, 
it would already be great if in the majority of cases we could avoid the double 
computation, while in other cases we raise a warning saying that computation is 
gonna happen twice.

This is specially important for a use case I have where a transformation 
creates random numbers, so I simply cant recompute things as results will be 
different. So in my case the only option to break lineage seems to be a full 
write() followed by read().
Any plans to have it in eager checkpoints at least?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org