On Jun 25, 2011, at 9:15am, Ted Dunning wrote: > I have had best results with somewhat beefier machines because you pay less > VM overhead.
Definitely matches my experience. For example, with m1.small instances the I/O performance is dreadful. So we typically run with m1.large, and spot pricing (for all slaves, not the master, as Sean notes). > Typical Hadoop configuration advice lately is 4GB per core and 1 disk > spindle per two cores. For higher performance systems like MapR, the number > of spindles can go up. The 4GB per core is a bit higher than what I typically use. E.g. most configs oversubscribe cores by 50% (12 cores => 18 total map + reduce slots), and 2GB per task is plenty - often you can use much less. Though I know of one config where a well known Hadoop consulting company set it up with 40 mappers and 4 reducers for a 24 core box (12 real cores). Depends on the target use case. -- Ken > On Sat, Jun 25, 2011 at 2:21 AM, Sean Owen <[email protected]> wrote: > >> I think EMR is well worth using. I just think you do want to throw more, >> and >> smaller, machines at the task than you imagine. I used the 'small' instance >> but you might get away with a fleet of micro instances even. And do most >> certainly request spot instances for your workers (but pay full rate for >> your master to ensure it's not killed). It stays reasonably economical this >> way, even if I wouldn't call this "dirt cheap". >> >> On Sat, Jun 25, 2011 at 9:06 AM, Chris Schilling <[email protected]> >> wrote: >> >>> Hey Sean, >>> >>> Just curious about your AWS comment. I am only in very early testing >>> phases with AWS EMR. So, would you say that you generally recommend >>> manually setting an EC2 cluster to run Mahout over EMR? I guess the >>> question is: for those of us without the resources to setup an in-house >>> hadoop cluster, what is the best setup we can hope to acheive? >>> >>> >> -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions
