Hi Lior, I can only hope you state this in Schekel! But 20 nodes with Hadoop can do quite a lot and you cannot compare a single Oracle box with a 20 node Hadoop cluster as they serve slightly different use-cases. You need to make a commitment to what you want to achieve with HBase and that growth is the most important factor. Scaling Oracle is really expensive while HBase/Hadoop is not in comparison and costs are linear, while with Oracle more exponential.
Lars On Mon, Nov 22, 2010 at 1:27 PM, Lior Schachter <[email protected]> wrote: > Hi all, Thanks for your input and assistance. > > > From your answers I understand that: > 1. more is better but our configuration might work. > 2. there are small tweaks we can do that will improve our configuration > (like having 4x500GB disks). > 3. use monitoring (like Ganglia) to find the bottlenecks. > > For me, The question here is how to balance between our current budget and > system stability (and performance). > I agree that more memory and more disk space will improve our responsiveness > but on the other hand our system is NOT expected to be real-time (but rather > a back office analytics with few hours delay). > > This is a crucial point since the proposed configurations we found in the > web don't distinguish between real-time configurations and back-office > configurations. To build a real-time cluster with 20 nodes will cost around > 200-300K (in Israel) this is similar to the price of a quite strong Oracle > cluster... so my boss (the CTO) was partially right when telling me - but > you said it would be cheap !! very cheap :) > > I believe that more money will come when we show the viability of the > system... I also read that heterogeneous clusters are common. > > It will help a lot if you can provide your configurations and system > characteristics (maybe in a Wiki page). > It will also help to get more of the "small tweaks" that you found helpful. > > > Lior Schachter > > > > > > > > On Mon, Nov 22, 2010 at 1:33 PM, Lars George <[email protected]> wrote: > >> Oleg, >> >> Do you have Ganglia or some other graphing tool running against the >> cluster? It gives you metrics that are crucial here, for example the >> load on Hadoop and its DataNodes as well as insertion rates etc. on >> HBase. What is also interesting is the compaction queue to see if the >> cluster is going slow. >> >> Did you try loading from an empty system to a loaded one? Or was it >> already filled and you are trying to add more? Are you spreading the >> load across servers or are you using sequential keys that tax only one >> server at a time? >> >> 16GB should work, but is not ideal. The various daemons simply need >> room to breathe. But that said, I have personally started with 12GB >> even and it worked. >> >> Lars >> >> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <[email protected]> >> wrote: >> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <[email protected] >> >wrote: >> > >> >> Oleg & Lior, >> >> >> >> Couple of questions & couple of suggestions to ponder: >> >> A) When you say 20 Name Servers, I assume you are talking about 20 Task >> >> Servers >> >> >> > >> > Yes >> > >> > >> >> B) What type are your M/R jobs ? Compute Intensive vs. storage >> intensive ? >> >> >> > >> > M/R -- most of it -- it is a parsing stuff , result of m/r 5% - 10% >> stores >> > to hbase >> > >> > >> >> C) What is your Data growth ? >> >> >> > >> > currently we have 50GB per day , it could be ~150GB. >> > >> > >> >> D) With the current jobs, are you saturating RAM ? CPU ? Or storage ? >> >> >> > Map phase takes 100% CPU consumption since it is a parsing and input >> > files are gz. >> > Definitely have a memory issues. >> > >> > >> >> Ganglia/Hadoop metrics should tell. >> >> E) Also are your jobs long running or short tasks ? >> >> >> > map tasks takes from 5 second to 2 minutes >> > reducer (insertion to hbase) takes -- ~3 hours >> > >> > >> >> Suggestions: >> >> A) Your name node could be 32 GB, 2TB Disk. Make sure it is an >> enterprise >> >> class server and also backup to an NFS mount. >> >> B) Also have a decent machine as the checkpoint name node. It could be >> >> similar to the task nodes >> >> B) I assume by Master Machine, you mean Job Tracker. It could be >> similar >> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk >> >> C) As Jean-Daniel pointed out 500GB (with more spindles) is what I >> would >> >> also recommend. But it also depends on your primary data, intermediate >> >> data and final data size. 1 or 2 TB disks are also fine, because they >> give >> >> you more strage. I assume you have the default replication of 3 >> >> D) A 1Gb dedicated network would be good. As there are only ~25 >> machines, >> >> you can hang them off of a good Gb switch. Consider 10Gb if there is too >> >> much intermediate data traffic, in the future. >> >> Cheers >> >> <k/> >> >> >> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <[email protected]> >> wrote: >> >> >> >> >Hi all, >> >> >After testing HBase for few months with very light configurations (5 >> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production. >> >> >Our Load - >> >> >1) 50GB log files to process per day by Map/Reduce jobs. >> >> >2) Insert 4-5GB to 3 tables in hbase. >> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table). >> >> >All this should run in parallel. >> >> >Our current configuration can't cope with this load and we are having >> many >> >> >stability issues. >> >> > >> >> >This is what we have in mind : >> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs. >> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs. >> >> >we plan to have up to 20 name servers (starting with 5). >> >> > >> >> >We already read >> >> > >> >> >> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba >> >> >sic-hardware-recommendations/ >> >> >. >> >> > >> >> >We would appreciate your feedback on our proposed configuration. >> >> > >> >> > >> >> >Regards Oleg & Lior >> >> >> >> >> >> >> > >> >
