Sean Owen <srowen <at> gmail.com> writes: > > You haven't even said what algorithm. It even depends on the distribution > of your data, in addition to amount, not to mention the type of servers, > configuration, etc. It's impossible to give a meaningful baseline. You can > run your real data on a real cluster to get some notion. Run-time and > requirements generally scale up linearly. > > On Wed, May 30, 2012 at 10:32 AM, jcuencaa > <jordi.cuenca.aubets <at> everis.com>wrote: > > > Hello! > > I need to do a capacity planning or a server sizing for a Mahout + Hadoop > > server, it means, plan how many servers and hardware (CPU, memory, etc.) do > > I need to accomplish with the maximum amount of work that my organization > > requires in a given period. > > I haven’t found documentation regarding to this in the Mahout or Hadoop > > site > > or, at least, which things should be taken into account for doing the > > server > > sizing. It’s obvious that sizing depends on many factors but, in example, > > in > > Application servers or Web Servers normally sizing is done inferring > > hardware needs using some benchmarks as a baseline. > > So I’d be pleased if someone can help me. > > Thanks in advance. > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Server-sizing-Hadoop-Mahout-tp3986807.html > > Sent from the Mahout User List mailing list archive at Nabble.com. > > >
Hi Sean! First of all, thanks for your reply! I do agree that it's very complicated to do the sizing of an environment since there are many variables that should be considerated. You have mentioned some of them: the algorithm, the distribution of data, the amount of data, type of hardware, etc. But I dont agree that it's impossible to give a baseline. Maybe should be a great idea for the Mahout+Hadoop community to take a look to this guys (Standard Performance Evaluation Corporation, http://www.spec.org/). They run the same benchmark on different types of architectures, establishing empirically a baseline that can be used as a good start point to do a capacity planning. They have a lot of benchmarks depending on CPU, Java Client Server, etc. Obviously, thats only a start point: before your software goes live to production mode, it's desirable to benchmark again your software running a load-test, adequating your infraestructure depending on performance results.
