Are these spills happening on map or reduce side? What is the memory allocated to each TaskTracker?
On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <[email protected]>wrote: > Hello, > > I have a file of size 9GB and having approximately 109.5 million records. > I execute a pig script on this file that is doing: > 1. Group by on a field of the file > 2. Count number of records in every group > 3. Store the result in a CSV file using normal PigStorage(",") > > The job is completed successfully but the job details show a lot of memory > spills. *Out of 109.5 million records, it shows approximately 48 million > records spilled.* > > I am executing it on a* 4 node cluster each with a dual core processor > and 4GB ram*. > > How can I minimize the amount of record spills. It really makes the > execution really slow when the spilling starts. > > Any suggestions are welcome. > > Thanking You, > > -- > Regards, > Ouch Whisper > 010101010101 >
