Re: Temp checkpoint directory for EMR (S3 or HDFS)

Asher Krim Tue, 30 May 2017 14:05:52 -0700

You should actually be able to get to the underlying filesystem from your
SparkContext:


String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS");


and then you could just use that:

String checkpointPath = String.format("%s/%s/", originalFs,
checkpointDirectory);
sparkContext.setCheckpointDir(checkpointPath);


Asher Krim
Senior Software Engineer

On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid
> wrote:

> Still haven't found a --conf option.
>
> Regarding a temporary HDFS checkpoint directory, it looks like when using
> --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
> variable. Thus, one could do the following when creating a SparkSession:
>
> val checkpointPath = new Path(System.getenv("SPARK_YARN_STAGING_DIR"),
> "checkpoints").toString
> sparkSession.sparkContext.setCheckpointDir(checkpointPath)
>
> The staging directory is in an HDFS path like
>
> /user/[user]/.sparkStaging/[YARN application ID]
>
> and is deleted at the end of the application
> <https://github.com/apache/spark/blob/branch-2.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L184>
> .
>
> So this is one option, though certainly abusing the staging directory.
>
> A more general one might be to find where Dataset.persist(DISK_ONLY)
> writes.
>
>
> On Fri, May 26, 2017 at 9:08 AM, Everett Anderson <ever...@nuna.com>
> wrote:
>
>> Hi,
>>
>> I need to set a checkpoint directory as I'm starting to use GraphFrames.
>> (Also, occasionally my regular DataFrame lineages get too long so it'd be
>> nice to use checkpointing to squash the lineage.)
>>
>> I don't actually need this checkpointed data to live beyond the life of
>> the job, however. I'm running jobs on AWS EMR (so on YARN + HDFS) and
>> reading and writing non-transient data to S3.
>>
>> Two questions:
>>
>> 1. Is there a Spark --conf option to set the checkpoint directory?
>> Somehow I couldn't find it, but surely it exists.
>>
>> 2. What's a good checkpoint directory for this use case? I imagine it'd
>> be on HDFS and presumably in a YARN application-specific temporary path
>> that gets cleaned up afterwards. Does anyone have a recommendation?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>

Re: Temp checkpoint directory for EMR (S3 or HDFS)

Reply via email to