Hi Lars, I agree with every sentence you wrote (and that's why we chose HBase). However, from a managerial point-of-view the question of the initial investment is very important (specially when considering a new technology).
Lior p.s. The price is in USD .... On Mon, Nov 22, 2010 at 2:43 PM, Lars George <[email protected]> wrote: > Hi Lior, > > I can only hope you state this in Schekel! But 20 nodes with Hadoop > can do quite a lot and you cannot compare a single Oracle box with a > 20 node Hadoop cluster as they serve slightly different use-cases. You > need to make a commitment to what you want to achieve with HBase and > that growth is the most important factor. Scaling Oracle is really > expensive while HBase/Hadoop is not in comparison and costs are > linear, while with Oracle more exponential. > > Lars > > On Mon, Nov 22, 2010 at 1:27 PM, Lior Schachter <[email protected]> > wrote: > > Hi all, Thanks for your input and assistance. > > > > > > From your answers I understand that: > > 1. more is better but our configuration might work. > > 2. there are small tweaks we can do that will improve our configuration > > (like having 4x500GB disks). > > 3. use monitoring (like Ganglia) to find the bottlenecks. > > > > For me, The question here is how to balance between our current budget > and > > system stability (and performance). > > I agree that more memory and more disk space will improve our > responsiveness > > but on the other hand our system is NOT expected to be real-time (but > rather > > a back office analytics with few hours delay). > > > > This is a crucial point since the proposed configurations we found in the > > web don't distinguish between real-time configurations and back-office > > configurations. To build a real-time cluster with 20 nodes will cost > around > > 200-300K (in Israel) this is similar to the price of a quite strong > Oracle > > cluster... so my boss (the CTO) was partially right when telling me - but > > you said it would be cheap !! very cheap :) > > > > I believe that more money will come when we show the viability of the > > system... I also read that heterogeneous clusters are common. > > > > It will help a lot if you can provide your configurations and system > > characteristics (maybe in a Wiki page). > > It will also help to get more of the "small tweaks" that you found > helpful. > > > > > > Lior Schachter > > > > > > > > > > > > > > > > On Mon, Nov 22, 2010 at 1:33 PM, Lars George <[email protected]> > wrote: > > > >> Oleg, > >> > >> Do you have Ganglia or some other graphing tool running against the > >> cluster? It gives you metrics that are crucial here, for example the > >> load on Hadoop and its DataNodes as well as insertion rates etc. on > >> HBase. What is also interesting is the compaction queue to see if the > >> cluster is going slow. > >> > >> Did you try loading from an empty system to a loaded one? Or was it > >> already filled and you are trying to add more? Are you spreading the > >> load across servers or are you using sequential keys that tax only one > >> server at a time? > >> > >> 16GB should work, but is not ideal. The various daemons simply need > >> room to breathe. But that said, I have personally started with 12GB > >> even and it worked. > >> > >> Lars > >> > >> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <[email protected]> > >> wrote: > >> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <[email protected] > >> >wrote: > >> > > >> >> Oleg & Lior, > >> >> > >> >> Couple of questions & couple of suggestions to ponder: > >> >> A) When you say 20 Name Servers, I assume you are talking about 20 > Task > >> >> Servers > >> >> > >> > > >> > Yes > >> > > >> > > >> >> B) What type are your M/R jobs ? Compute Intensive vs. storage > >> intensive ? > >> >> > >> > > >> > M/R -- most of it -- it is a parsing stuff , result of m/r 5% - 10% > >> stores > >> > to hbase > >> > > >> > > >> >> C) What is your Data growth ? > >> >> > >> > > >> > currently we have 50GB per day , it could be ~150GB. > >> > > >> > > >> >> D) With the current jobs, are you saturating RAM ? CPU ? Or storage > ? > >> >> > >> > Map phase takes 100% CPU consumption since it is a parsing and > input > >> > files are gz. > >> > Definitely have a memory issues. > >> > > >> > > >> >> Ganglia/Hadoop metrics should tell. > >> >> E) Also are your jobs long running or short tasks ? > >> >> > >> > map tasks takes from 5 second to 2 minutes > >> > reducer (insertion to hbase) takes -- ~3 hours > >> > > >> > > >> >> Suggestions: > >> >> A) Your name node could be 32 GB, 2TB Disk. Make sure it is an > >> enterprise > >> >> class server and also backup to an NFS mount. > >> >> B) Also have a decent machine as the checkpoint name node. It could > be > >> >> similar to the task nodes > >> >> B) I assume by Master Machine, you mean Job Tracker. It could be > >> similar > >> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk > >> >> C) As Jean-Daniel pointed out 500GB (with more spindles) is what I > >> would > >> >> also recommend. But it also depends on your primary data, > intermediate > >> >> data and final data size. 1 or 2 TB disks are also fine, because they > >> give > >> >> you more strage. I assume you have the default replication of 3 > >> >> D) A 1Gb dedicated network would be good. As there are only ~25 > >> machines, > >> >> you can hang them off of a good Gb switch. Consider 10Gb if there is > too > >> >> much intermediate data traffic, in the future. > >> >> Cheers > >> >> <k/> > >> >> > >> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <[email protected]> > >> wrote: > >> >> > >> >> >Hi all, > >> >> >After testing HBase for few months with very light configurations > (5 > >> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production. > >> >> >Our Load - > >> >> >1) 50GB log files to process per day by Map/Reduce jobs. > >> >> >2) Insert 4-5GB to 3 tables in hbase. > >> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table). > >> >> >All this should run in parallel. > >> >> >Our current configuration can't cope with this load and we are > having > >> many > >> >> >stability issues. > >> >> > > >> >> >This is what we have in mind : > >> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs. > >> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs. > >> >> >we plan to have up to 20 name servers (starting with 5). > >> >> > > >> >> >We already read > >> >> > > >> >> > >> > http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba > >> >> >sic-hardware-recommendations/ > >> >> >. > >> >> > > >> >> >We would appreciate your feedback on our proposed configuration. > >> >> > > >> >> > > >> >> >Regards Oleg & Lior > >> >> > >> >> > >> >> > >> > > >> > > >
