I have a pig script that I've translated from an old Python job. The old script worked by read a bunch of lines of JSON into sqlite and running queries again that. The sqlite DB ended up being about 1gb on disk by the end of the job (it's about a year's worth of data) and the whole job ran in 40 to 60 minutes single-threaded on a single machine.
The pig version (sadly I require pig 0.6, as I'm running on EMR), much to my surprise, is *way* slower. Run against a few day's worth of data it takes about 15 seconds, against a month's worth takes about 1.1 minutes, against two months takes about 3 minutes, but against 3 months of data in local mode on my 4-proc laptop it takes hours. So long in fact that it didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I did get a job to finish after 5h44m using 10 m1.small nodes, which is pretty nuts compared to the single-proc Python version. There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so it's not that big. And the code is, I think, pretty simple. Take a look: http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty simple too, it's basically a hack of ElephantBird's JSON loader to dive deeper into the JSON and make bags out of JSON lists in addition to simpler maps that EB does http://pastebin.com/dFKX3AJc While it's running in local mode on my laptop it outputs a lot (about one per minute) of messages like this: 2011-11-29 18:34:21,518 [Low Memory Detector] INFO org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700522216(1660666K) committed = 2060255232(2011968K) max = 2060255232(2011968K) 2011-11-29 18:34:30,773 [Low Memory Detector] INFO org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700519216(1660663K) committed = 2060255232(2011968K) max = 2060255232(2011968K) 2011-11-29 18:34:40,953 [Low Memory Detector] INFO org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700518024(1660662K) committed = 2060255232(2011968K) max = 2060255232(2011968K) But I'm not sure how to read EMR's debugging logs to know if it's doing that in mapreduce mode on EMR too. So my questions are: 1. Is that pig script doing anything that's really that performance intensive? Is the loader doing anything obviously bad? Why on earth is this so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10 2. How do people generally go about profiling these scripts? 3. Is that "Low Memory Detector" error in local mode anything to be worried about? Or is it just telling me that some intermediate dataset doesn't fit in RAM and is being spilled to disc?
