On Saturday, October 3, 2015, Everett Anderson <[email protected]> wrote:
> > > On Thu, Oct 1, 2015 at 9:28 PM, Josh Wills <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> So that approach (hacky as it is) will work, and is really the only >> obvious way that the planner can know which PCollections should be kept >> around and which ones are okay to delete. I would expect it to work >> indefinitely in future versions, and I'm always open to API enhancements >> that make this sort of logic easier to express. >> > > Two more questions -- > > 1) In general in the Crunch programming model, should references to > collections remain viable across calls to run()? > Yes, although some recomputation may happen depending on which PCollections were materialized/written on the previous run. > > 2) How does this solution relate to something like > table.cache(CachingOptions.builder().useDisk(true).build()); > > ? > > Somehow using cache() seems natural, here, but currently in the MRPipeline > I think cache() has maybe 3 branches depending on the input table, and one > of them results in an intermediate output in the regular temp directory. > Yeah, cache() in MR is really a shorthand for materialize(). The CachingOptions only kick in when there is some flexibility in the caching mechanism (e.g., for Spark.) > > > > > >> >> J >> >> On Thu, Oct 1, 2015 at 3:28 PM, Everett Anderson <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >> >>> (Context: This is related to the 'LeaseExpiredExceptions and temp side >>> effect files' thread.) >>> >>> In particular, the workaround would mean that we'd keep using the same >>> PCollection/PTable references after a call to run()/cleanup(), which feels >>> weird. >>> >>> Example: >>> >>> PTable liveTable = ... >>> liveTable = liveTable.parallelDo(...) >>> >>> // Write the table somewhere we know won't get cleaned up, >>> // which changes its internal Target. >>> liveTable.write(To.sequenceFile(tempPath), >>> Target.WriteMode.CHECKPOINT); >>> >>> // Call run() and cleanup() to flush old temporary data. >>> pipeline.run(); >>> pipeline.cleanup(false); >>> >>> // Keep using liveTable since we know it'll work under the >>> // covers because its Target is a sequence file that wasn't >>> // cleaned up. >>> liveTable = liveTable.parallelDo(...) >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Oct 1, 2015 at 10:54 AM, Jeff Quinn <[email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: >>> >>>> Hello, >>>> >>>> Our crunch pipeline has suffered from ballooning HDFS usage which >>>> spikes during the course of the job. Our solution has been to call >>>> Pipeline.run() and Pipeline.cleanup() between the major operations, hoping >>>> to achieve periodic "garbage collection" of the temporary outputs that are >>>> produced during the course of the pipeline. >>>> >>>> The problem is some PCollections from one operation will need to be >>>> used as input to subsequent operations, and cleanup() seems to blow away >>>> ALL PCollections that have not been explicitly written to a target (from >>>> reading the source, it seems to just blow away the pipeline temp >>>> directory). >>>> >>>> Our workaround has been to explicitly call .write on the PCollections >>>> we know we will need across calls to run()/cleanup(). This seems to work as >>>> far as I can tell, but it feels hacky. Is there a better or more supported >>>> way to handle this, and is this approach likely to fail in future crunch >>>> versions? >>>> >>>> Thanks! >>>> >>>> Jeff >>>> >>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>> may contain information that is confidential, proprietary in nature, >>>> protected health information (PHI), or otherwise protected by law from >>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>> are not the intended recipient, you are hereby notified that any use, >>>> disclosure or copying of this email, including any attachments, is >>>> unauthorized and strictly prohibited. If you have received this email in >>>> error, please notify the sender of this email. Please delete this and all >>>> copies of this email from your system. Any opinions either expressed or >>>> implied in this email and all attachments, are those of its author only, >>>> and do not necessarily reflect those of Nuna Health, Inc. >>> >>> >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
