David, Hadoop was engineered to efficiently process small number of large files and not the other way around. Since PIG utilizes Hadoop it will have a similar limitation. Some improvement have been made on that front (CombinedInputFormat) but the performance is still lacking.
The version of PIG you are using is plagued with memory issues which would affect performance especially when accumulating large amount of data in the reducers. It would be hard to give you any pointers without seeing the script that you are using. Generally though, you might want to consider the following: 1. Using a newer version of PIG (There might be a workaround that can be put in place on EMR to do that.). 2. Increasing the memory available to each mapper / reducer. 3. Reduce the amount of input files by concatenating small files into one large file(s). For example combine daily files into monthly or yearly files. Lastly, do not consider execution speed on your laptop as a benchmark. Hadoop gets it's power by running in the distributed mode on multiple nodes. Local mode will generally perform much worse then single threaded process since it's trying to mimic what happens on the cluster which requires quite a bit of coordination between mappers and reducers. Alex On Wed, Nov 30, 2011 at 1:05 AM, David King <[email protected]> wrote: > I have a pig script that I've translated from an old Python job. The old > script worked by read a bunch of lines of JSON into sqlite and running > queries again that. The sqlite DB ended up being about 1gb on disk by the > end of the job (it's about a year's worth of data) and the whole job ran in > 40 to 60 minutes single-threaded on a single machine. > > The pig version (sadly I require pig 0.6, as I'm running on EMR), much to > my surprise, is *way* slower. Run against a few day's worth of data it > takes about 15 seconds, against a month's worth takes about 1.1 minutes, > against two months takes about 3 minutes, but against 3 months of data in > local mode on my 4-proc laptop it takes hours. So long in fact that it > didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I > did get a job to finish after 5h44m using 10 m1.small nodes, which is > pretty nuts compared to the single-proc Python version. > > There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so > it's not that big. And the code is, I think, pretty simple. Take a look: > http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty > simple too, it's basically a hack of ElephantBird's JSON loader to dive > deeper into the JSON and make bags out of JSON lists in addition to simpler > maps that EB does http://pastebin.com/dFKX3AJc > > While it's running in local mode on my laptop it outputs a lot (about one > per minute) of messages like this: > > 2011-11-29 18:34:21,518 [Low Memory Detector] INFO > org.apache.pig.impl.util.SpillableMemoryManager - low memory handler > called (Collection threshold exceeded) init = 65404928(63872K) used = > 1700522216(1660666K) committed = 2060255232(2011968K) max = > 2060255232(2011968K) > 2011-11-29 18:34:30,773 [Low Memory Detector] INFO > org.apache.pig.impl.util.SpillableMemoryManager - low memory handler > called (Collection threshold exceeded) init = 65404928(63872K) used = > 1700519216(1660663K) committed = 2060255232(2011968K) max = > 2060255232(2011968K) > 2011-11-29 18:34:40,953 [Low Memory Detector] INFO > org.apache.pig.impl.util.SpillableMemoryManager - low memory handler > called (Collection threshold exceeded) init = 65404928(63872K) used = > 1700518024(1660662K) committed = 2060255232(2011968K) max = > 2060255232(2011968K) > > But I'm not sure how to read EMR's debugging logs to know if it's doing > that in mapreduce mode on EMR too. So my questions are: > > 1. Is that pig script doing anything that's really that performance > intensive? Is the loader doing anything obviously bad? Why on earth is this > so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10 > > 2. How do people generally go about profiling these scripts? > > 3. Is that "Low Memory Detector" error in local mode anything to be > worried about? Or is it just telling me that some intermediate dataset > doesn't fit in RAM and is being spilled to disc? > >
