(I don't know the answer to this, but as I also now run Crunch on top of S3, I'm interested in a solution.)
On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <[email protected]> wrote: > Hey All, > > We have run in to a pretty frustrating inefficiency inside of > the CrunchJobHooks.CompletionHook#handleMultiPaths. > > This method loops over all of the partial output files and moves them to > their ultimate destination directories, > calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path, > org.apache.hadoop.fs.Path) on each partial output in a loop. > > This is no problem when the org.apache.hadoop.fs.FileSystem in question is > HDFS where #rename is a cheap operation, but when an implementation such > as S3NativeFileSystem is used it is extremely inefficient, as each > iteration through the loop makes a single blocking S3 API call, and this > loop can be extremely long when there are many thousands of partial output > files. > > Has anyone dealt with this before / have any ideas to work around? > > Thanks! > > Jeff > > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc.
