We have some aggregate statistics we are gathering in an NLP application , and we have implemented it in Pig (using 0.7.0 on Hadoop 0.20.2 with Java 1.6.0_22). But some of the mapreduce jobs use a lot of memory (12G of RAM in one process, for a data set that is 50G in text format at the start of the computation). I would like to gain some insight into what portion of the Pig script is being executed by a given mapreduce task. I dumped the plan using EXPLAIN but I am having trouble interpreting the output. I can¹t find any resources online that help me understand these plans. Does anyone have any pointers, or any better ideas on how to debug the logic? I bet I am doing something that is easy to optimize away, but since I expect to use Pig more going forward, I¹d like to have a bag of tricks to diagnose its behavior.
Many thanks, Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310 437 7300 SDL PLC confidential, all rights reserved. If you are not the intended recipient of this mail SDL requests and requires that you delete it without acting upon or copying any of its contents, and we further request that you advise us. SDL PLC is a public limited company registered in England and Wales. Registered number: 02675207. Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, UK.
