On Thu, Oct 1, 2015 at 9:28 PM, Josh Wills <[email protected]> wrote:
> So that approach (hacky as it is) will work, and is really the only
> obvious way that the planner can know which PCollections should be kept
> around and which ones are okay to delete. I would expect it to work
> indefinitely in future versions, and I'm always open to API enhancements
> that make this sort of logic easier to express.
>
Two more questions --
1) In general in the Crunch programming model, should references to
collections remain viable across calls to run()?
2) How does this solution relate to something like
table.cache(CachingOptions.builder().useDisk(true).build());
?
Somehow using cache() seems natural, here, but currently in the MRPipeline
I think cache() has maybe 3 branches depending on the input table, and one
of them results in an intermediate output in the regular temp directory.
>
> J
>
> On Thu, Oct 1, 2015 at 3:28 PM, Everett Anderson <[email protected]> wrote:
>
>> (Context: This is related to the 'LeaseExpiredExceptions and temp side
>> effect files' thread.)
>>
>> In particular, the workaround would mean that we'd keep using the same
>> PCollection/PTable references after a call to run()/cleanup(), which feels
>> weird.
>>
>> Example:
>>
>> PTable liveTable = ...
>> liveTable = liveTable.parallelDo(...)
>>
>> // Write the table somewhere we know won't get cleaned up,
>> // which changes its internal Target.
>> liveTable.write(To.sequenceFile(tempPath),
>> Target.WriteMode.CHECKPOINT);
>>
>> // Call run() and cleanup() to flush old temporary data.
>> pipeline.run();
>> pipeline.cleanup(false);
>>
>> // Keep using liveTable since we know it'll work under the
>> // covers because its Target is a sequence file that wasn't
>> // cleaned up.
>> liveTable = liveTable.parallelDo(...)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 1, 2015 at 10:54 AM, Jeff Quinn <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> Our crunch pipeline has suffered from ballooning HDFS usage which spikes
>>> during the course of the job. Our solution has been to call Pipeline.run()
>>> and Pipeline.cleanup() between the major operations, hoping to achieve
>>> periodic "garbage collection" of the temporary outputs that are produced
>>> during the course of the pipeline.
>>>
>>> The problem is some PCollections from one operation will need to be used
>>> as input to subsequent operations, and cleanup() seems to blow away ALL
>>> PCollections that have not been explicitly written to a target (from
>>> reading the source, it seems to just blow away the pipeline temp directory).
>>>
>>> Our workaround has been to explicitly call .write on the PCollections we
>>> know we will need across calls to run()/cleanup(). This seems to work as
>>> far as I can tell, but it feels hacky. Is there a better or more supported
>>> way to handle this, and is this approach likely to fail in future crunch
>>> versions?
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>> *DISCLAIMER:* The contents of this email, including any attachments, may
>> contain information that is confidential, proprietary in nature, protected
>> health information (PHI), or otherwise protected by law from disclosure,
>> and is solely for the use of the intended recipient(s). If you are not the
>> intended recipient, you are hereby notified that any use, disclosure or
>> copying of this email, including any attachments, is unauthorized and
>> strictly prohibited. If you have received this email in error, please
>> notify the sender of this email. Please delete this and all copies of this
>> email from your system. Any opinions either expressed or implied in this
>> email and all attachments, are those of its author only, and do not
>> necessarily reflect those of Nuna Health, Inc.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
--
*DISCLAIMER:* The contents of this email, including any attachments, may
contain information that is confidential, proprietary in nature, protected
health information (PHI), or otherwise protected by law from disclosure,
and is solely for the use of the intended recipient(s). If you are not the
intended recipient, you are hereby notified that any use, disclosure or
copying of this email, including any attachments, is unauthorized and
strictly prohibited. If you have received this email in error, please
notify the sender of this email. Please delete this and all copies of this
email from your system. Any opinions either expressed or implied in this
email and all attachments, are those of its author only, and do not
necessarily reflect those of Nuna Health, Inc.