Sounds good to me; either a PR or patch on the JIRA works as a way to get it committed. On Mon, Nov 23, 2015 at 11:16 PM Jeff Quinn <[email protected]> wrote:
> So far the only solution I can imagine is using a thread pool to make all > the copy/delete requests. Reading this relevant blog post gives me some > confidence that a thread pool could work and not be horribly slow: > http://shlomoswidler.com/2010/05/how-i-moved-5-of-all-objects-in-s3-with-jets3t.html > > If we wanted to submit a patch what would be a good approach? I was > thinking maybe org.apache.crunch.io.PathTarget#handleOutputs could return a > ListenableFuture that executes the rename and CrunchJobHooks. > CompletionHook#handleMultiPaths could use a threadpool executor, with a > configuration property controlling number of threads.. > > On Mon, Nov 23, 2015 at 7:47 PM, Josh Wills <[email protected]> wrote: > >> No, just moving to Slack from Cloudera, my data team is all of two >> people* right now, and a dedicated Hadoop ops person doesn't make sense yet. >> >> * But of course, I'm hiring. :) >> >> On Mon, Nov 23, 2015 at 6:43 PM Everett Anderson <[email protected]> >> wrote: >> >>> Josh, not to steal the thread, but I'm quite curious -- did something >>> drive you to using S3 instead of HDFS? >>> >>> For me, I've been surprised how brittle HDFS seems out of the box in the >>> face of even mild load. :( We've spent a lot of time turning knobs to make >>> our data nodes stay responsive. >>> >>> >>> On Mon, Nov 23, 2015 at 5:45 PM, Josh Wills <[email protected]> >>> wrote: >>> >>>> (I don't know the answer to this, but as I also now run Crunch on top >>>> of S3, I'm interested in a solution.) >>>> >>>> On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <[email protected]> wrote: >>>> >>>>> Hey All, >>>>> >>>>> We have run in to a pretty frustrating inefficiency inside of >>>>> the CrunchJobHooks.CompletionHook#handleMultiPaths. >>>>> >>>>> This method loops over all of the partial output files and moves them >>>>> to their ultimate destination directories, >>>>> calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path, >>>>> org.apache.hadoop.fs.Path) on each partial output in a loop. >>>>> >>>>> This is no problem when the org.apache.hadoop.fs.FileSystem in >>>>> question is HDFS where #rename is a cheap operation, but when an >>>>> implementation such as S3NativeFileSystem is used it is extremely >>>>> inefficient, as each iteration through the loop makes a single blocking S3 >>>>> API call, and this loop can be extremely long when there are many >>>>> thousands >>>>> of partial output files. >>>>> >>>>> Has anyone dealt with this before / have any ideas to work around? >>>>> >>>>> Thanks! >>>>> >>>>> Jeff >>>>> >>>>> >>>>> >>>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>>> may contain information that is confidential, proprietary in nature, >>>>> protected health information (PHI), or otherwise protected by law from >>>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>>> are not the intended recipient, you are hereby notified that any use, >>>>> disclosure or copying of this email, including any attachments, is >>>>> unauthorized and strictly prohibited. If you have received this email in >>>>> error, please notify the sender of this email. Please delete this and all >>>>> copies of this email from your system. Any opinions either expressed or >>>>> implied in this email and all attachments, are those of its author only, >>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>> >>>> >>>> >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >> >> > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc.
