I think it really depends on your script and your environment. A good approach may be to split up the script into logical code blocks (jobs), then execute those jobs in series via a bash script. I have found it also helpful to persist the data from these jobs (not the intermediate data) to a persistent data store; in case something goes wrong, you don't have to rerun prior computations, just from the last failed job (at the cost of additional loads). This modular approach has been helpful in development; you still get the Pig optimization benefits per module, and this will allow for future expansion, such as concurrent job execution on your cluster and optimizing cluster capacity.
Hope this helps, -Dan On Wed, Mar 5, 2014 at 10:33 AM, Christopher Petrino <[email protected]>wrote: > Hi all, what is everyone's approach for managing a Pig scripts that has > become very long? What is your best way to break it up into smaller pieces? >
