Preventing Cleanup of PCollections

Jeff Quinn Thu, 01 Oct 2015 10:55:04 -0700

Hello,

Our crunch pipeline has suffered from ballooning HDFS usage which spikes
during the course of the job. Our solution has been to call Pipeline.run()
and Pipeline.cleanup() between the major operations, hoping to achieve
periodic "garbage collection" of the temporary outputs that are produced
during the course of the pipeline.


The problem is some PCollections from one operation will need to be used as
input to subsequent operations, and cleanup() seems to blow away ALL
PCollections that have not been explicitly written to a target (from
reading the source, it seems to just blow away the pipeline temp directory).

Our workaround has been to explicitly call .write on the PCollections we
know we will need across calls to run()/cleanup(). This seems to work as
far as I can tell, but it feels hacky. Is there a better or more supported
way to handle this, and is this approach likely to fail in future crunch
versions?

Thanks!

Jeff

-- 
*DISCLAIMER:* The contents of this email, including any attachments, may 
contain information that is confidential, proprietary in nature, protected 
health information (PHI), or otherwise protected by law from disclosure, 
and is solely for the use of the intended recipient(s). If you are not the 
intended recipient, you are hereby notified that any use, disclosure or 
copying of this email, including any attachments, is unauthorized and 
strictly prohibited. If you have received this email in error, please 
notify the sender of this email. Please delete this and all copies of this 
email from your system. Any opinions either expressed or implied in this 
email and all attachments, are those of its author only, and do not 
necessarily reflect those of Nuna Health, Inc.

Preventing Cleanup of PCollections

Reply via email to