Aeden, this is probably happening because you are using the Hadoop
implementation of S3.

The Hadoop S3 filesystem tries to imitate a filesystem on top of S3. In so
doing it makes a lot of HEAD requests. These are expensive, and they
violate read-after-create visibility, which is what you seem to be
experiencing. By contrast, the Presto S3 implementation doesn't do the same
(harmful in this case) magic, and simply does PUT/GET operations. Because
that's all Flink needs to checkpointing, this works much better.

Best,
David

On Thu, May 12, 2022 at 1:53 AM Aeden Jameson <aeden.jame...@gmail.com>
wrote:

> We're using S3 to store checkpoints. They are taken every minute. I'm
> seeing a large number of 404 responses from S3 being generated by the
> job manager. The order of the entries in the debugging log would imply
> that it's a result of a HEAD request to a key. For example all the
> incidents look like this,
>
>
> 2022-05-11 23:29:00,804 DEBUG com.amazonaws.request [] - Sending
> Request: HEAD https://[MY-BUCKET].s3.amazonaws.com
> /[MY_JOB]/checkpoints/5f4d6923883a1702b206f978fa3637a3/ Headers:
> (amz-sdk-invocation-id: XXXXX, Content-Type: application/octet-stream,
> User-Agent: Hadoop 3.1.0, aws-sdk-java/1.11.788
> Linux/5.4.181-99.354.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.13+8
> java/11.0.13 scala/2.12.7 vendor/Oracle_Corporation, )
>
> 2022-05-11 23:29:00,815 DEBUG com.amazonaws.request [] - Received
> error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not
> Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not
> Found; ......)
>
> The key does in fact exist. How can I go about resolving this?
>
> --
> Cheers,
> Aeden
>
> GitHub: https://github.com/aedenj
>

Reply via email to