There are some details on scaleFactor here.[1] Essentially crunch would use it for a couple of options:
1. Calculating the number of reducers to use when grouping and nothing is specified 2. Optimizing to decrease how much I/O it has to do if possible. In the last situation if your pipeline might will require state to persist Crunch will try to optimize to do the least amount of I/O at the cost of doing recalculations. So it might persist to disk right before your DoFn if it creates a significant amount more of data. If you know the data increase is that significant it would definitely be advisable to override the method and give a more reasonable factor value. >> Can I tell it to leverage more mappers/reducers in the DoFn? Scale factors will be applicable for the number of reducers but shouldn't affect mappers as that would be controlled by the input splits. [1] - http://crunch.apache.org/user-guide.html#doplan [2] - https://github.com/apache/crunch/blob/188360048a7f2d3cedf5fc915b48d7671f1d8d46/crunch-core/src/main/java/org/apache/crunch/DoFn.java#L132 On Tue, Nov 3, 2015 at 10:38 AM, Robinson, Landon - Landon < [email protected]> wrote: > All, > > I’m trying to understand how I might use scaleFactor() in my Crunch code. > My use case is this: I have data that I read into a Pcollection that is > smaller than my system’s block size, but when processed in a DoFn, *grows* > pretty exponentially. > > So what started as a 10mb file might become 10 times larger. > > To prevent spills and memory issues, how could I leverage something like > scaleFactor() (or whatever is needed) to indicate to the Crunch Planner > that my resulting Pcollection will grow exponentially? > Can I tell it to leverage more mappers/reducers in the DoFn? > > Guidance, if you could! > > Thanks, > Landon > --------------------------------------------------------------------------- > > Landon Robinson > --------------------------------------------------------------------------- > NOTICE: All information in and attached to the e-mails below may be > proprietary, confidential, privileged and otherwise protected from improper > or erroneous disclosure. If you are not the sender's intended recipient, > you are not authorized to intercept, read, print, retain, copy, forward, or > disseminate this message. If you have erroneously received this > communication, please notify the sender immediately by phone (704-758-1000) > or by e-mail and destroy all copies of this message electronic, paper, or > otherwise. > > *By transmitting documents via this email: Users, Customers, Suppliers and > Vendors collectively acknowledge and agree the transmittal of information > via email is voluntary, is offered as a convenience, and is not a secured > method of communication; Not to transmit any payment information E.G. > credit card, debit card, checking account, wire transfer information, > passwords, or sensitive and personal information E.G. Driver's license, > DOB, social security, or any other information the user wishes to remain > confidential; To transmit only non-confidential information such as plans, > pictures and drawings and to assume all risk and liability for and > indemnify Lowe's from any claims, losses or damages that may arise from the > transmittal of documents or including non-confidential information in the > body of an email transmittal. Thank you. * >
