Re: Preventing Cleanup of PCollections

Everett Anderson Sat, 03 Oct 2015 12:00:07 -0700

On Thu, Oct 1, 2015 at 9:28 PM, Josh Wills <[email protected]> wrote:


> So that approach (hacky as it is) will work, and is really the only
> obvious way that the planner can know which PCollections should be kept
> around and which ones are okay to delete. I would expect it to work
> indefinitely in future versions, and I'm always open to API enhancements
> that make this sort of logic easier to express.
>

Two more questions --

1) In general in the Crunch programming model, should references to
collections  remain viable across calls to run()?

2) How does this solution relate to something like
    table.cache(CachingOptions.builder().useDisk(true).build());

?

Somehow using cache() seems natural, here, but currently in the MRPipeline
I think cache() has maybe 3 branches depending on the input table, and one
of them results in an intermediate output in the regular temp directory.





>
> J
>
> On Thu, Oct 1, 2015 at 3:28 PM, Everett Anderson <[email protected]> wrote:
>
>> (Context: This is related to the 'LeaseExpiredExceptions and temp side
>> effect files' thread.)
>>
>> In particular, the workaround would mean that we'd keep using the same
>> PCollection/PTable references after a call to run()/cleanup(), which feels
>> weird.
>>
>> Example:
>>
>> PTable liveTable = ...
>> liveTable = liveTable.parallelDo(...)
>>
>> // Write the table somewhere we know won't get cleaned up,
>> // which changes its internal Target.
>> liveTable.write(To.sequenceFile(tempPath),
>>                 Target.WriteMode.CHECKPOINT);
>>
>> // Call run() and cleanup() to flush old temporary data.
>> pipeline.run();
>> pipeline.cleanup(false);
>>
>> // Keep using liveTable since we know it'll work under the
>> // covers because its Target is a sequence file that wasn't
>> // cleaned up.
>> liveTable = liveTable.parallelDo(...)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 1, 2015 at 10:54 AM, Jeff Quinn <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> Our crunch pipeline has suffered from ballooning HDFS usage which spikes
>>> during the course of the job. Our solution has been to call Pipeline.run()
>>> and Pipeline.cleanup() between the major operations, hoping to achieve
>>> periodic "garbage collection" of the temporary outputs that are produced
>>> during the course of the pipeline.
>>>
>>> The problem is some PCollections from one operation will need to be used
>>> as input to subsequent operations, and cleanup() seems to blow away ALL
>>> PCollections that have not been explicitly written to a target (from
>>> reading the source, it seems to just blow away the pipeline temp directory).
>>>
>>> Our workaround has been to explicitly call .write on the PCollections we
>>> know we will need across calls to run()/cleanup(). This seems to work as
>>> far as I can tell, but it feels hacky. Is there a better or more supported
>>> way to handle this, and is this approach likely to fail in future crunch
>>> versions?
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Re: Preventing Cleanup of PCollections

Reply via email to