Hello, Our crunch pipeline has suffered from ballooning HDFS usage which spikes during the course of the job. Our solution has been to call Pipeline.run() and Pipeline.cleanup() between the major operations, hoping to achieve periodic "garbage collection" of the temporary outputs that are produced during the course of the pipeline.
The problem is some PCollections from one operation will need to be used as input to subsequent operations, and cleanup() seems to blow away ALL PCollections that have not been explicitly written to a target (from reading the source, it seems to just blow away the pipeline temp directory). Our workaround has been to explicitly call .write on the PCollections we know we will need across calls to run()/cleanup(). This seems to work as far as I can tell, but it feels hacky. Is there a better or more supported way to handle this, and is this approach likely to fail in future crunch versions? Thanks! Jeff -- *DISCLAIMER:* The contents of this email, including any attachments, may contain information that is confidential, proprietary in nature, protected health information (PHI), or otherwise protected by law from disclosure, and is solely for the use of the intended recipient(s). If you are not the intended recipient, you are hereby notified that any use, disclosure or copying of this email, including any attachments, is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender of this email. Please delete this and all copies of this email from your system. Any opinions either expressed or implied in this email and all attachments, are those of its author only, and do not necessarily reflect those of Nuna Health, Inc.
