Hey Ryan, Thanks for the detailed writeup and great job explaining the question and the links :) W.r.t Renaming, Hudi avoids renaming metadata files altogether and creates immutable metadata filenames encoded with state of the commit. Generally, We believe some of the consistency solutions out there have been written in early days of S3 when the guarantees were not well estabilished/understood. S3 consistency guard in Hudi has been fairly battle-tested for a while by the community now in their production cluser. Are you seeing any specific issues in your setup ? Once again thanks for your interest in Hudi Balaji.V On Wednesday, August 12, 2020, 10:35:05 AM PDT, Ryan Murray <[email protected]> wrote: Hey all, I've been playing around with Hudi for a little while now. Really like it! Thanks for all the work :-) I do have a question about S3 and consistency: How does Hudi get around eventual consistency in S3? Particularly in the case of metadata files. I can see there is a ConsistencyGuard[1] which ensures that the JVM Thread its run in can see a path, however it isn't clear to me that this would be valid across a system. If a writer 'A' performs an action which requires a rename for example how can we ensure that readers B and C see the newly renamed file? Or even that nodes across reader B (eg a spark cluster) see the same file content? To me this is checking if an object is visible from a particular thread rather than checking the eventual consistency restrictions of S3[2]. People have gone to great lengths to get around S3s consistency issues as well [3][4].
Apologies if this is a naive question, I am still grappling with the Hudi commit model. Best,Ryan [1] https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/fs/ConsistencyGuard.java[2] https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel[3] https://github.com/Netflix/s3mper[4] https://docs.delta.io/latest/delta-storage.html#amazon-s3
