On 27 Jan 2017, at 23:17, VND Tremblay, Paul 
<tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>> wrote:

Not sure what you mean by "a consistency layer on top." Any explanation would 
be greatly appreciated!

Paul



netflix's s3mper: https://github.com/Netflix/s3mper

EMR consistency: 
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

AWS S3: s3guard (Wip) : https://issues.apache.org/jira/browse/HADOOP-13345

All of these do the same thing: use amazon DynamoDB for storing all the 
metadata, guaranteeing that every client gets a consistent view of deletes, 
adds, and the listings returned match the state of the system. otherwise list 
commands tend to lag changes, meaning deleted files are still mistakenly 
considered as being there, and lists of paths can miss out newly created files. 
That means that there's no guarantee that the commit-by-rename protocol used in 
Hadoop FileOutputFormat may miss out files to rename, so lose results.

S3guard will guarantee that listing is consistent, and will be a precursor to 
the 0-rename committer I'm working on, which needs that consistent list to find 
the .pending files listing outstanding operations to commit.


Reply via email to