I have a pig script that I've translated from an old Python job. The old script 
worked by read a bunch of lines of JSON into sqlite and running queries again 
that. The sqlite DB ended up being about 1gb on disk by the end of the job 
(it's about a year's worth of data) and the whole job ran in 40 to 60 minutes 
single-threaded on a single machine.

The pig version (sadly I require pig 0.6, as I'm running on EMR), much to my 
surprise, is *way* slower. Run against a few day's worth of data it takes about 
15 seconds, against a month's worth takes about 1.1 minutes, against two months 
takes about 3 minutes, but against 3 months of data in local mode on my 4-proc 
laptop it takes hours. So long in fact that it didn't finish in an entire 
work-day and I had to kill it. On Amazon's EMR I did get a job to finish after 
5h44m using 10 m1.small nodes, which is pretty nuts compared to the single-proc 
Python version.

There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so it's 
not that big. And the code is, I think, pretty simple. Take a look: 
http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty simple too, 
it's basically a hack of ElephantBird's JSON loader to dive deeper into the 
JSON and make bags out of JSON lists in addition to simpler maps that EB does 
http://pastebin.com/dFKX3AJc

While it's running in local mode on my laptop it outputs a lot (about one per 
minute) of messages like this:

2011-11-29 18:34:21,518 [Low Memory Detector] INFO  
org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called 
(Collection threshold exceeded) init = 65404928(63872K) used = 
1700522216(1660666K) committed = 2060255232(2011968K) max = 2060255232(2011968K)
2011-11-29 18:34:30,773 [Low Memory Detector] INFO  
org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called 
(Collection threshold exceeded) init = 65404928(63872K) used = 
1700519216(1660663K) committed = 2060255232(2011968K) max = 2060255232(2011968K)
2011-11-29 18:34:40,953 [Low Memory Detector] INFO  
org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called 
(Collection threshold exceeded) init = 65404928(63872K) used = 
1700518024(1660662K) committed = 2060255232(2011968K) max = 2060255232(2011968K)

But I'm not sure how to read EMR's debugging logs to know if it's doing that in 
mapreduce mode on EMR too. So my questions are:

1. Is that pig script doing anything that's really that performance intensive? 
Is the loader doing anything obviously bad? Why on earth is this so slow? This 
whole dataset should fit in RAM on 2 nodes, let alone 10

2. How do people generally go about profiling these scripts?

3. Is that "Low Memory Detector" error in local mode anything to be worried 
about? Or is it just telling me that some intermediate dataset doesn't fit in 
RAM and is being spilled to disc?

Reply via email to