Re: S3A + EMR failure when writing Parquet?

Steve Loughran Mon, 05 Sep 2016 09:42:12 -0700

On 4 Sep 2016, at 18:05, Everett Anderson 
<ever...@nuna.com<mailto:ever...@nuna.com>> wrote:


My impression from reading your various other replies on S3A is that it's also 
best to use mapreduce.fileoutputcommitter.algorithm.version=2 (which might 
someday be the default<https://issues.apache.org/jira/browse/MAPREDUCE-6336>) 
and,

for now yes; there's work under way by various people to implement consistency 
and cache performance: S3guard 
https://issues.apache.org/jira/browse/HADOOP-13345  . That'll need to come with 
a new commit algorithm which works with it and other object stores with similar 
semantics (Azure WASB). I want an O(1) commit there with a very small (1).

presumably if your data fits well in memory, use fs.s3a.fast.upload=true. Is 
that right?


as of last week: no.

Having written a test to upload multi-GB files generated at the speed of memory 
copies, I think that is at both scale. If you are generating data faster than 
it can be uploaded, you will OOM.


Small datasets running in-EC2 on large instances, or installations where you 
have a local object store supporting S3 API, you should get away with it. Bulk 
uploads over long-haul networks: no.

Keep an eye on : https://issues.apache.org/jira/browse/HADOOP-13560

Re: S3A + EMR failure when writing Parquet?

Reply via email to