I thought Todd Lipcon's Hadoop Summit presentation [1] had some good info on this topic.
[1] http://www.slideshare.net/cloudera/mr-perf Norbert On Thu, Mar 7, 2013 at 7:25 PM, Prashant Kommireddi <[email protected]>wrote: > You can do a few things here > > > 1. Increase mapred.child.java.opts to a higher number (default is > 200MB). You will have to do this while making sure (# of MR slots/node X > mapred.child.java.opts + 387 < 4GB). May be you want to stay under 3.5GB > based on other stuff running on those nodes. > 2. Increase "mapred.job.shuffle.input.buffer.percent" to have more heap > be available for the shuffle > 3. > 4. Set mapred.inmem.merge.threshold to 0 > and mapred.job.reduce.input.buffer.percent to 0.8 > > You will have to play around with these to see what works for your needs. > > You can additionally refer to "Hadoop: Definitive Guide" for tips on config > tuning. > > On Thu, Mar 7, 2013 at 1:01 PM, Panshul Whisper <[email protected] > >wrote: > > > Hello Prashant, > > > > I have a CDH installation and by default memory allocated to each task > > tracker is 387 MB. > > And yes these spills are happening on Map and Reduce side. > > > > Still not solved this problem... > > > > Suggestions are welcome. > > > > Thanking You, > > > > Regards, > > > > > > On Thu, Mar 7, 2013 at 9:05 AM, Prashant Kommireddi <[email protected] > > >wrote: > > > > > Are these spills happening on map or reduce side? What is the memory > > > allocated to each TaskTracker? > > > > > > On Wed, Mar 6, 2013 at 6:28 AM, Panshul Whisper <[email protected] > > > >wrote: > > > > > > > Hello, > > > > > > > > I have a file of size 9GB and having approximately 109.5 million > > records. > > > > I execute a pig script on this file that is doing: > > > > 1. Group by on a field of the file > > > > 2. Count number of records in every group > > > > 3. Store the result in a CSV file using normal PigStorage(",") > > > > > > > > The job is completed successfully but the job details show a lot of > > > memory > > > > spills. *Out of 109.5 million records, it shows approximately 48 > > million > > > > records spilled.* > > > > > > > > I am executing it on a* 4 node cluster each with a dual core > processor > > > > and 4GB ram*. > > > > > > > > How can I minimize the amount of record spills. It really makes the > > > > execution really slow when the spilling starts. > > > > > > > > Any suggestions are welcome. > > > > > > > > Thanking You, > > > > > > > > -- > > > > Regards, > > > > Ouch Whisper > > > > 010101010101 > > > > > > > > > > > > > > > -- > > Regards, > > Ouch Whisper > > 010101010101 > > >
