My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue
Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov <yuva...@gmail.com> wrote: > > > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I have a lifecycle policy that deletes data older than X days back > from the bucket in order for it to not infinitely grow. My problem starts > with Spark jobs that don't have frequent data. What will happen in this case > is that new batches will not be created, which in turn means no new > checkpoints will be written to the output path and no overwrites to the > _spark_metadata file will be performed, thus eventually causing the > lifecycle policy to delete the file which causes the job to fail. > > As far as I can tell from reading the code and looking at StackOverflow > answers, _spark_metadata file path is hardcoded to the base path of the > output directory created by the DataStreamWriter, which means I cannot store > this file in a separate prefix which is not under the lifecycle policy rule. > > Has anyone run into a similar problem? > > > > -- > Best Regards, > Yuval Itzchakov. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org