Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem

Ron Hashimshony Tue, 24 Nov 2015 07:08:08 -0800

We just use S3DistCp, works pretty well (we usually use EMR, but DistCp
does the same in our Cloudera cluster).


On Tue, Nov 24, 2015 at 9:46 AM Josh Wills <[email protected]> wrote:

> Sounds good to me; either a PR or patch on the JIRA works as a way to get
> it committed.
> On Mon, Nov 23, 2015 at 11:16 PM Jeff Quinn <[email protected]> wrote:
>
>> So far the only solution I can imagine is using a thread pool to make all
>> the copy/delete requests. Reading this relevant blog post gives me some
>> confidence that a thread pool could work and not be horribly slow:
>> http://shlomoswidler.com/2010/05/how-i-moved-5-of-all-objects-in-s3-with-jets3t.html
>>
>> If we wanted to submit a patch what would be a good approach? I was
>> thinking maybe org.apache.crunch.io.PathTarget#handleOutputs could return a
>> ListenableFuture that executes the rename and CrunchJobHooks.
>> CompletionHook#handleMultiPaths could use a threadpool executor, with a
>> configuration property controlling number of threads..
>>
>> On Mon, Nov 23, 2015 at 7:47 PM, Josh Wills <[email protected]> wrote:
>>
>>> No, just moving to Slack from Cloudera, my data team is all of two
>>> people* right now, and a dedicated Hadoop ops person doesn't make sense yet.
>>>
>>> * But of course, I'm hiring. :)
>>>
>>> On Mon, Nov 23, 2015 at 6:43 PM Everett Anderson <[email protected]>
>>> wrote:
>>>
>>>> Josh, not to steal the thread, but I'm quite curious -- did something
>>>> drive you to using S3 instead of HDFS?
>>>>
>>>> For me, I've been surprised how brittle HDFS seems out of the box in
>>>> the face of even mild load. :( We've spent a lot of time turning knobs to
>>>> make our data nodes stay responsive.
>>>>
>>>>
>>>> On Mon, Nov 23, 2015 at 5:45 PM, Josh Wills <[email protected]>
>>>> wrote:
>>>>
>>>>> (I don't know the answer to this, but as I also now run Crunch on top
>>>>> of S3, I'm interested in a solution.)
>>>>>
>>>>> On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <[email protected]> wrote:
>>>>>
>>>>>> Hey All,
>>>>>>
>>>>>> We have run in to a pretty frustrating inefficiency inside of
>>>>>> the CrunchJobHooks.CompletionHook#handleMultiPaths.
>>>>>>
>>>>>> This method loops over all of the partial output files and moves them
>>>>>> to their ultimate destination directories,
>>>>>> calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path,
>>>>>> org.apache.hadoop.fs.Path) on each partial output in a loop.
>>>>>>
>>>>>> This is no problem when the org.apache.hadoop.fs.FileSystem in
>>>>>> question is HDFS where #rename is a cheap operation, but when an
>>>>>> implementation such as S3NativeFileSystem is used it is extremely
>>>>>> inefficient, as each iteration through the loop makes a single blocking 
>>>>>> S3
>>>>>> API call, and this loop can be extremely long when there are many 
>>>>>> thousands
>>>>>> of partial output files.
>>>>>>
>>>>>> Has anyone dealt with this before / have any ideas to work around?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>>
>>>>>>
>>>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>>>> may contain information that is confidential, proprietary in nature,
>>>>>> protected health information (PHI), or otherwise protected by law from
>>>>>> disclosure, and is solely for the use of the intended recipient(s). If 
>>>>>> you
>>>>>> are not the intended recipient, you are hereby notified that any use,
>>>>>> disclosure or copying of this email, including any attachments, is
>>>>>> unauthorized and strictly prohibited. If you have received this email in
>>>>>> error, please notify the sender of this email. Please delete this and all
>>>>>> copies of this email from your system. Any opinions either expressed or
>>>>>> implied in this email and all attachments, are those of its author only,
>>>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>>>
>>>>>
>>>>>
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>
>

Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem

Reply via email to