i understand that checkpoint cuts the lineage, but i am not fully sure i
understand the role of eager.

eager simply seems to materialize the rdd early with a count, right after
the rdd has been checkpointed. but why is that useful? rdd.checkpoint is
asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed
will be false, and the count will be on the rdd before it was checkpointed.
what is the benefit of that?


On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Hi,
>
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you
> run into StackOverflowExceptions. If you eagerly checkpoint, you basically
> cut the lineage there, and the next operations all depend on the
> checkpointed DataFrame. If you don't checkpoint, you continue to build the
> lineage, therefore while that lineage is being resolved, you may hit the
> StackOverflowException.
>
> HTH,
> Burak
>
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <j...@jgp.net> wrote:
>
>> Hey Sparkers,
>>
>> Trying to understand the Dataframe's checkpoint (*not* in the context of
>> streaming) https://spark.apache.org/docs/latest/api/java/
>> org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>>
>> What is the goal of the *eager* flag?
>>
>> Thanks!
>>
>> jg
>>
>
>

Reply via email to