Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem

Josh Wills Mon, 23 Nov 2015 23:46:54 -0800

Sounds good to me; either a PR or patch on the JIRA works as a way to get
it committed.
On Mon, Nov 23, 2015 at 11:16 PM Jeff Quinn <[email protected]> wrote:


> So far the only solution I can imagine is using a thread pool to make all
> the copy/delete requests. Reading this relevant blog post gives me some
> confidence that a thread pool could work and not be horribly slow:
> http://shlomoswidler.com/2010/05/how-i-moved-5-of-all-objects-in-s3-with-jets3t.html
>
> If we wanted to submit a patch what would be a good approach? I was
> thinking maybe org.apache.crunch.io.PathTarget#handleOutputs could return a
> ListenableFuture that executes the rename and CrunchJobHooks.
> CompletionHook#handleMultiPaths could use a threadpool executor, with a
> configuration property controlling number of threads..
>
> On Mon, Nov 23, 2015 at 7:47 PM, Josh Wills <[email protected]> wrote:
>
>> No, just moving to Slack from Cloudera, my data team is all of two
>> people* right now, and a dedicated Hadoop ops person doesn't make sense yet.
>>
>> * But of course, I'm hiring. :)
>>
>> On Mon, Nov 23, 2015 at 6:43 PM Everett Anderson <[email protected]>
>> wrote:
>>
>>> Josh, not to steal the thread, but I'm quite curious -- did something
>>> drive you to using S3 instead of HDFS?
>>>
>>> For me, I've been surprised how brittle HDFS seems out of the box in the
>>> face of even mild load. :( We've spent a lot of time turning knobs to make
>>> our data nodes stay responsive.
>>>
>>>
>>> On Mon, Nov 23, 2015 at 5:45 PM, Josh Wills <[email protected]>
>>> wrote:
>>>
>>>> (I don't know the answer to this, but as I also now run Crunch on top
>>>> of S3, I'm interested in a solution.)
>>>>
>>>> On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <[email protected]> wrote:
>>>>
>>>>> Hey All,
>>>>>
>>>>> We have run in to a pretty frustrating inefficiency inside of
>>>>> the CrunchJobHooks.CompletionHook#handleMultiPaths.
>>>>>
>>>>> This method loops over all of the partial output files and moves them
>>>>> to their ultimate destination directories,
>>>>> calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path,
>>>>> org.apache.hadoop.fs.Path) on each partial output in a loop.
>>>>>
>>>>> This is no problem when the org.apache.hadoop.fs.FileSystem in
>>>>> question is HDFS where #rename is a cheap operation, but when an
>>>>> implementation such as S3NativeFileSystem is used it is extremely
>>>>> inefficient, as each iteration through the loop makes a single blocking S3
>>>>> API call, and this loop can be extremely long when there are many 
>>>>> thousands
>>>>> of partial output files.
>>>>>
>>>>> Has anyone dealt with this before / have any ideas to work around?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>>
>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>> may contain information that is confidential, proprietary in nature,
>>>>> protected health information (PHI), or otherwise protected by law from
>>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>> disclosure or copying of this email, including any attachments, is
>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>> error, please notify the sender of this email. Please delete this and all
>>>>> copies of this email from your system. Any opinions either expressed or
>>>>> implied in this email and all attachments, are those of its author only,
>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>
>>>>
>>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem

Reply via email to