Aeden,

I want to expand my answer after having re-read your question a bit more
carefully.

For point 1 the behavior you are seeing is what is expected. With hadoop
the metadata written by the job manager will literally include "_entropy_"
in its path, while this will be replaced in paths of any and all checkpoint
data files. With presto the metadata path won't include "_entropy_" at all
(it will disappear, rather than being replaced by something specific).

For point 2, I'm not sure.

David

On Thu, May 19, 2022 at 2:37 PM David Anderson <da...@nosredna.org> wrote:

> This sounds like it could be FLINK-17359 [1]. What version of Flink are
> you using?
>
> Another likely explanation arises from the fact that only the
> checkpoint data files (the ones created and written by the task managers)
> will have the _entropy_ replaced. The job manager does not inject entropy
> into the path of the checkpoint metadata, so that it remains at a
> predictable URI. Since Flink only writes keyed state larger than
> state.storage.fs.memory-threshold into the checkpoint data files, and only
> those files have entropy injected into their paths, if all of your state is
> small it will all end up in the metadata file and you don't see any entropy
> injection happening. See the comments on [2] for more on this.
>
> FWIW, I would urge you to use presto instead of hadoop for checkpointing
> on S3. The performance of the hadoop "filesystem" is problematic when it's
> used for checkpointing.
>
> Regards,,
> David
>
> [1] https://issues.apache.org/jira/browse/FLINK-17359
> [2] https://issues.apache.org/jira/browse/FLINK-24878
>
> On Wed, May 18, 2022 at 7:48 PM Aeden Jameson <aeden.jame...@gmail.com>
> wrote:
>
>> I have checkpoints setup against s3 using the hadoop plugin. (I'll
>> migrate to presto at some point) I've setup entropy injection per the
>> documentation with
>>
>> state.checkpoints.dir: s3://my-bucket/_entropy_/my-job/checkpoints
>> s3.entropy.key: _entropy_
>>
>> I'm seeing some behavior that I don't quite understand.
>>
>> 1. The folder s3://my-bucket/_entropy_/my-job/checkpoints/...
>> literally exists. Meaning that "_entropy_" has not been replaced. At
>> the same time there are also a bunch of folders where "_entropy_" has
>> been replaced. Is that to be expected? If so, would someone elaborate
>> on why this is happening?
>>
>> 2. Should the paths in the checkpoints history tab in the FlinkUI
>> display the path the key? With the current setup it is not.
>>
>> Thanks,
>> Aeden
>>
>> GitHub: https://github.com/aedenj
>> Linked In: http://www.linkedin.com/in/aedenjameson
>>
>

Reply via email to