Sean Owen <srowen <at> gmail.com> writes:

> 
> You haven't even said what algorithm. It even depends on the distribution
> of your data, in addition to amount, not to mention the type of servers,
> configuration, etc. It's impossible to give a meaningful baseline. You can
> run your real data on a real cluster to get some notion. Run-time and
> requirements generally scale up linearly.
> 
> On Wed, May 30, 2012 at 10:32 AM, jcuencaa
> <jordi.cuenca.aubets <at> everis.com>wrote:
> 
> > Hello!
> > I need to do a capacity planning or a server sizing for a Mahout + Hadoop
> > server, it means, plan how many servers and hardware (CPU, memory, etc.) do
> > I need to accomplish with the maximum amount of work that my organization
> > requires in a given period.
> > I haven’t found documentation regarding to this in the Mahout or Hadoop
> > site
> > or, at least, which things should be taken into account for doing the
> > server
> > sizing. It’s obvious that sizing depends on many factors but, in example,
> > in
> > Application servers or Web Servers normally sizing is done inferring
> > hardware needs using some benchmarks as a baseline.
> > So I’d be pleased if someone can help me.
> > Thanks in advance.
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Server-sizing-Hadoop-Mahout-tp3986807.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
> 

Hi Sean! First of all, thanks for your reply!
I do agree that it's very complicated to do the sizing of an environment since
there are many variables that should be considerated. You have mentioned some of
them: the algorithm, the distribution of data, the amount of data, type of
hardware, etc.
But I dont agree that it's impossible to give a baseline. 
Maybe should be a great idea for the Mahout+Hadoop community to take a look to
this guys (Standard Performance Evaluation Corporation, http://www.spec.org/).
They run the same benchmark on different types of architectures, establishing
empirically a baseline that can be used as a good start point to do a capacity
planning. 
They have a lot of benchmarks depending on CPU, Java Client Server, etc.
Obviously, thats only a start point: before your software goes live to
production mode, it's desirable to benchmark again your software running a
load-test, adequating your infraestructure depending on performance results.


Reply via email to