RE: Hadoop/HBase hardware requirement

Michael Segel Mon, 22 Nov 2010 09:04:11 -0800

Just to put things in perspective...

16GB isn't a lot of memory for a data node when it comes to 8core Nehelem ?sp? 
(Intel e5500 CPUs)


I would suggest you take a look at your Ganglia output on your cloud.

>From our experience, we're disk i/o bound. Our DN typically have 8 core, 32GB, 
>and 4x 2TB 7200 RPM SATA Drives.
(Going across 2 racks is also an issue... but that's a different story.)

In terms of general Hadoop (We also run HBase)... 
The more disks (spindles) the better. 
So if you're going to cap your boxes to 16GB, then look at 4 core and 4x2TB 
drives in each box.
I'd even consider a motherboard that could support 2 sockets and only populate1 
socket.

As others have pointed out in different discussions, you will be limited in the 
number of m/r tasks by memory. 

4 cores is 8 virtual CPUs so you could run 8 m/r tasks assuming 512MB per task 
and then you have 1GB for datanode and 1GB for Task Tracker.
The you'd have some head room to run 1024MB per task, or more if you end up 
with larger jobs.


HTH

-Mike

> Date: Mon, 22 Nov 2010 15:50:05 +0200
> Subject: Re: Hadoop/HBase hardware requirement
> From: [email protected]
> To: [email protected]
> 
> I believe our bottleneck is currently with memory.
> Most of the CPU is dedicated to parsing and working with gzip files (we
> don't have heavy computational tasks).
> But on the other hand, more memory and disk mean we can run more M/R and
> scans which needs more CPU....
> 
> Lior
> 
> 
> On Mon, Nov 22, 2010 at 3:22 PM, Lars George <[email protected]> wrote:
> 
> > Hi Lior,
> >
> > Depends on your load, is it IO or CPU bound? Sounds like IO or Disk
> > from the above, right? I would opt for the more machines! This will
> > spread the load better across the cluster. And you can always add more
> > disks in v2 of your setup.
> >
> > Lars
> >
> > On Mon, Nov 22, 2010 at 1:56 PM, Lior Schachter <[email protected]>
> > wrote:
> > > And another more concrete question:
> > > lets say that on every machine with two quad core CPUs, 4T and 16GB I can
> > > buy 2 machines with one quad, 2T, 16GB.
> > >
> > > Which configuration should I choose ?
> > >
> > > Lior
> > >
> > > On Mon, Nov 22, 2010 at 2:27 PM, Lior Schachter <[email protected]>
> > wrote:
> > >
> > >> Hi all, Thanks for your input and assistance.
> > >>
> > >>
> > >> From your answers I understand that:
> > >> 1. more is better but our configuration might work.
> > >> 2. there are small tweaks we can do that will improve our configuration
> > >> (like having 4x500GB disks).
> > >> 3. use monitoring (like Ganglia) to find the bottlenecks.
> > >>
> > >> For me, The question here is how to balance between our current budget
> > and
> > >> system stability (and performance).
> > >> I agree that more memory and more disk space will improve our
> > >> responsiveness but on the other hand our system is NOT expected to be
> > >> real-time (but rather a back office analytics with few hours delay).
> > >>
> > >> This is a crucial point since the proposed configurations we found in
> > the
> > >> web don't distinguish between real-time configurations and back-office
> > >> configurations. To build a real-time cluster with 20 nodes will cost
> > around
> > >> 200-300K (in Israel) this is similar to the price of a quite strong
> > Oracle
> > >> cluster... so my boss (the CTO) was partially right when telling me -
> > but
> > >> you said it would be cheap !! very cheap :)
> > >>
> > >> I believe that more money will come when we show the viability of the
> > >> system... I also read that heterogeneous clusters are common.
> > >>
> > >> It will help a lot if you can provide your configurations and system
> > >> characteristics (maybe in a Wiki page).
> > >> It will also help to get more of the "small tweaks" that you found
> > helpful.
> > >>
> > >>
> > >> Lior Schachter
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Nov 22, 2010 at 1:33 PM, Lars George <[email protected]
> > >wrote:
> > >>
> > >>> Oleg,
> > >>>
> > >>> Do you have Ganglia or some other graphing tool running against the
> > >>> cluster? It gives you metrics that are crucial here, for example the
> > >>> load on Hadoop and its DataNodes as well as insertion rates etc. on
> > >>> HBase. What is also interesting is the compaction queue to see if the
> > >>> cluster is going slow.
> > >>>
> > >>> Did you try loading from an empty system to a loaded one? Or was it
> > >>> already filled and you are trying to add more? Are you spreading the
> > >>> load across servers or are you using sequential keys that tax only one
> > >>> server at a time?
> > >>>
> > >>> 16GB should work, but is not ideal. The various daemons simply need
> > >>> room to breathe. But that said, I have personally started with 12GB
> > >>> even and it worked.
> > >>>
> > >>> Lars
> > >>>
> > >>> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <[email protected]
> > >
> > >>> wrote:
> > >>> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <
> > [email protected]
> > >>> >wrote:
> > >>> >
> > >>> >> Oleg & Lior,
> > >>> >>
> > >>> >> Couple of questions & couple of suggestions to ponder:
> > >>> >> A)  When you say 20 Name Servers, I assume you are talking about 20
> > >>> Task
> > >>> >> Servers
> > >>> >>
> > >>> >
> > >>> > Yes
> > >>> >
> > >>> >
> > >>> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> > >>> intensive ?
> > >>> >>
> > >>> >
> > >>> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> > >>> stores
> > >>> > to hbase
> > >>> >
> > >>> >
> > >>> >> C)  What is your Data growth ?
> > >>> >>
> > >>> >
> > >>> >  currently we have 50GB per day , it could be ~150GB.
> > >>> >
> > >>> >
> > >>> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage
> > ?
> > >>> >>
> > >>> >    Map phase takes 100% CPU consumption since it is a parsing and
> > input
> > >>> > files are  gz.
> > >>> >    Definitely have a memory issues.
> > >>> >
> > >>> >
> > >>> >> Ganglia/Hadoop metrics should tell.
> > >>> >> E)  Also are your jobs long running or short tasks ?
> > >>> >>
> > >>> >    map tasks takes from 5 second to 2 minutes
> > >>> >    reducer (insertion to hbase) takes -- ~3 hours
> > >>> >
> > >>> >
> > >>> >> Suggestions:
> > >>> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> > >>> enterprise
> > >>> >> class server and also backup to an NFS mount.
> > >>> >> B)  Also have a decent machine as the checkpoint name node. It could
> > be
> > >>> >> similar to the task nodes
> > >>> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> > >>> similar
> > >>> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> > >>> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> > >>> would
> > >>> >> also recommend. But it also depends on your primary data,
> > intermediate
> > >>> >> data and final data size. 1 or 2 TB disks are also fine, because
> > they
> > >>> give
> > >>> >> you more strage. I assume you have the default replication of 3
> > >>> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> > >>> machines,
> > >>> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
> > >>> too
> > >>> >> much intermediate data traffic, in the future.
> > >>> >> Cheers
> > >>> >> <k/>
> > >>> >>
> > >>> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <[email protected]>
> > >>> wrote:
> > >>> >>
> > >>> >> >Hi all,
> > >>> >> >After testing HBase for few months with very light configurations
> >  (5
> > >>> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> > >>> >> >Our Load -
> > >>> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> > >>> >> >2)  Insert 4-5GB to 3 tables in hbase.
> > >>> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> > >>> >> >All this should run in parallel.
> > >>> >> >Our current configuration can't cope with this load and we are
> > having
> > >>> many
> > >>> >> >stability issues.
> > >>> >> >
> > >>> >> >This is what we have in mind :
> > >>> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> > >>> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> > >>> >> >we plan to have up to 20 name servers (starting with 5).
> > >>> >> >
> > >>> >> >We already read
> > >>> >> >
> > >>> >>
> > >>>
> > http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> > >>> >> >sic-hardware-recommendations/
> > >>> >> >.
> > >>> >> >
> > >>> >> >We would appreciate your feedback on our proposed configuration.
> > >>> >> >
> > >>> >> >
> > >>> >> >Regards Oleg & Lior
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >

RE: Hadoop/HBase hardware requirement

Reply via email to