Aeden, this is probably happening because you are using the Hadoop implementation of S3.
The Hadoop S3 filesystem tries to imitate a filesystem on top of S3. In so doing it makes a lot of HEAD requests. These are expensive, and they violate read-after-create visibility, which is what you seem to be experiencing. By contrast, the Presto S3 implementation doesn't do the same (harmful in this case) magic, and simply does PUT/GET operations. Because that's all Flink needs to checkpointing, this works much better. Best, David On Thu, May 12, 2022 at 1:53 AM Aeden Jameson <aeden.jame...@gmail.com> wrote: > We're using S3 to store checkpoints. They are taken every minute. I'm > seeing a large number of 404 responses from S3 being generated by the > job manager. The order of the entries in the debugging log would imply > that it's a result of a HEAD request to a key. For example all the > incidents look like this, > > > 2022-05-11 23:29:00,804 DEBUG com.amazonaws.request [] - Sending > Request: HEAD https://[MY-BUCKET].s3.amazonaws.com > /[MY_JOB]/checkpoints/5f4d6923883a1702b206f978fa3637a3/ Headers: > (amz-sdk-invocation-id: XXXXX, Content-Type: application/octet-stream, > User-Agent: Hadoop 3.1.0, aws-sdk-java/1.11.788 > Linux/5.4.181-99.354.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.13+8 > java/11.0.13 scala/2.12.7 vendor/Oracle_Corporation, ) > > 2022-05-11 23:29:00,815 DEBUG com.amazonaws.request [] - Received > error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not > Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not > Found; ......) > > The key does in fact exist. How can I go about resolving this? > > -- > Cheers, > Aeden > > GitHub: https://github.com/aedenj >