On 4 Sep 2016, at 18:05, Everett Anderson <ever...@nuna.com<mailto:ever...@nuna.com>> wrote:
My impression from reading your various other replies on S3A is that it's also best to use mapreduce.fileoutputcommitter.algorithm.version=2 (which might someday be the default<https://issues.apache.org/jira/browse/MAPREDUCE-6336>) and, for now yes; there's work under way by various people to implement consistency and cache performance: S3guard https://issues.apache.org/jira/browse/HADOOP-13345 . That'll need to come with a new commit algorithm which works with it and other object stores with similar semantics (Azure WASB). I want an O(1) commit there with a very small (1). presumably if your data fits well in memory, use fs.s3a.fast.upload=true. Is that right? as of last week: no. Having written a test to upload multi-GB files generated at the speed of memory copies, I think that is at both scale. If you are generating data faster than it can be uploaded, you will OOM. Small datasets running in-EC2 on large instances, or installations where you have a local object store supporting S3 API, you should get away with it. Bulk uploads over long-haul networks: no. Keep an eye on : https://issues.apache.org/jira/browse/HADOOP-13560