On 27 Jan 2017, at 23:17, VND Tremblay, Paul <tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>> wrote:
Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul netflix's s3mper: https://github.com/Netflix/s3mper EMR consistency: http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html AWS S3: s3guard (Wip) : https://issues.apache.org/jira/browse/HADOOP-13345 All of these do the same thing: use amazon DynamoDB for storing all the metadata, guaranteeing that every client gets a consistent view of deletes, adds, and the listings returned match the state of the system. otherwise list commands tend to lag changes, meaning deleted files are still mistakenly considered as being there, and lists of paths can miss out newly created files. That means that there's no guarantee that the commit-by-rename protocol used in Hadoop FileOutputFormat may miss out files to rename, so lose results. S3guard will guarantee that listing is consistent, and will be a precursor to the 0-rename committer I'm working on, which needs that consistent list to find the .pending files listing outstanding operations to commit.