Good to know. Trying single node hadoop cluster now. The main input is about 1+ million lines of events. After some aggregation, it joins with another input source which has also about 1+ million rows. Is this considered small query? Thanks.
On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[email protected]> wrote: > Local mode and mapreduce mode makes a huge difference. For a small query, > the mapreduce overhead will dominate. For a fair comparison, can you setup a > single node hadoop cluster on your laptop and run Pig on it? > > Daniel > > > On 06/14/2011 10:54 AM, Dexin Wang wrote: > > Thanks for your feedback. My comments below. > > On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <[email protected]>wrote: > >> Curious, couple of questions: >> 1. Are you running in local mode or mapreduce mode? >> > Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I > ran it on ec2 cluster. > > 2. If mapreduce mode, did you look into the hadoop log to see how much >> slow down each mapreduce job does? >> > I'm looking into that. > > >> 3. What kind of query is it? >> >> The input is gzipped json files which has one event per line. Then I do > some hourly aggregation on the raw events, then do bunch of groupping, > joining and some metrics computing (like median, variance) on some fields. > > Daniel >> >> Someone mentioned it's EC2's I/O performance. But I'm sure there are > plenty of people using EC2/EMR running big MR jobs so more likely I have > some configuration issues? My jobs can be optimized a bit but the fact that > running on my laptop is faster tells me this is a separate issue. > > Thanks! > > > >> On 06/13/2011 11:54 AM, Dexin Wang wrote: >> >>> Hi, >>> >>> This is probably not directly a Pig question. >>> >>> Anyone running Pig on amazon EC2 instances? Something's not making sense >>> to >>> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node >>> cluster using m1.small. It took *13 minutes*. The job reads input from S3 >>> and writes output to S3. But from the logs the reading and writing part >>> to/from S3 is pretty fast. And all the intermediate steps should happen >>> on >>> HDFS. >>> >>> Running the same job on my mbp laptop, it only took *3 minutes*. >>> >>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig >>> 0.6 >>> on my laptop. Some hadoop config is probably also not ideal. I tried >>> m1.large instead of m1.small, doesn't seem to make a huge difference. >>> Anything you would suggest to look for the slowness on EC2? >>> >>> Dexin >>> >> >> > >
