My naïve  assumption that specifying lifecycle policy for _spark_metadata with 
longer retention will solve the issue

Best regards

> On 13 Apr 2023, at 11:52, Yuval Itzchakov <yuva...@gmail.com> wrote:
> 
> 
> Hi everyone,
> 
> I am using Sparks FileStreamSink in order to write files to S3. On the S3 
> bucket, I have a lifecycle policy that deletes data older than X days back 
> from the bucket in order for it to not infinitely grow. My problem starts 
> with Spark jobs that don't have frequent data. What will happen in this case 
> is that new batches will not be created, which in turn means no new 
> checkpoints will be written to the output path and no overwrites to the 
> _spark_metadata file will be performed, thus eventually causing the  
> lifecycle policy to delete the file which causes the job to fail.
> 
> As far as I can tell from reading the code and looking at StackOverflow 
> answers, _spark_metadata file path is hardcoded to the base path of the 
> output directory created by the DataStreamWriter, which means I cannot store 
> this file in a separate prefix which is not under the lifecycle policy rule.
> 
> Has anyone run into a similar problem?
> 
> 
> 
> -- 
> Best Regards,
> Yuval Itzchakov.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  • _spark_metadata pa... Yuval Itzchakov
    • Re: _spark_me... Yuri Oleynikov (‫יורי אולייניקוב‬‎)

Reply via email to