Maybe you should give a little more information about your RAID controller (write back / write through ?) and the underlying filesystem (ext3 ? blocksize ?).
Very interesting benchmark and discussion by the way :-) On Thu, Dec 20, 2012 at 11:07 PM, Jean-Marc Spaggiari < [email protected]> wrote: > I did the test with a 2GB file... So read and write were spread over the 2 > drives for RAID0. > > Those test were to give an overall idea of the performances vs CPU usage > etc. and you might need to adjust them based on the way it's configured on > your system. > > I don't know how RAID0 is managing small files (<=64k) but maybe it's still > spread on the 2 disks too? > > JM > > 2012/12/20 Varun Sharma <[email protected]> > > > Hmm, I thought that RAID0 simply stripes across all disks. So if you got > 4 > > disks - an HFile block for example could get striped across 4 disks. So > to > > read that block, you would need all 4 of them to seek so that you could > > read all 4 stripes for that HFile block. This could make things as slow > as > > the slowest seeking disk for that random read. However, certainly, data > > xfer rate would be much faster with RAID0 but since this is merely 64K > for > > a HFile block, I would have expected the seek latency to play a major > role > > and not really the data xfer latency. > > > > However, your tests indeed show that RAID0 still outperforms JBOD on > seeks. > > Am I missing something ? > > > > On Thu, Dec 20, 2012 at 1:26 PM, Jean-Marc Spaggiari < > > [email protected]> wrote: > > > > > Hi Varun, > > > > > > The hard drivers I used are now used on the hadoop/hbase cluster, but > > they > > > was clear and formated for the tests I did. The computer where I run > > those > > > tests was one of the region servers. It was re-installed to be very > > clear, > > > and it's now running a datanode and a RS. > > > > > > Regarding RAID, I think you are confusing RAID0 and RAID1. It's RAID1 > > which > > > need to access the 2 files each time. RAID0 is more like JBOD, but > > faster. > > > > > > JM > > > > > > 2012/12/20 Varun Sharma <[email protected]> > > > > > > > Hi Jean, > > > > > > > > Very interesting benchmark - how are these numbers arrived at ? Is > this > > > on > > > > a real hbase cluster ? To me, it felt kind of counter intuitive that > > > RAID0 > > > > beats JBOD on random seeks because with RAID0 all disks need to seek > at > > > the > > > > same time and the performance should basically be as bad as the > slowest > > > > seeking disk. > > > > > > > > Varun > > > > > > > > On Wed, Dec 19, 2012 at 5:14 PM, Michael Segel < > > > [email protected] > > > > >wrote: > > > > > > > > > Yeah, > > > > > I couldn't argue against LVMs when talking with the system admins. > > > > > In terms of speed its noise because the CPUs are pretty efficient > and > > > > > unless you have more than 1 drive per physical core, you will end > up > > > > > saturating your disk I/O. > > > > > > > > > > In terms of MapR, you want the raw disk. (But we're talking Apache) > > > > > > > > > > > > > > > On Dec 19, 2012, at 4:59 PM, Jean-Marc Spaggiari < > > > > [email protected]> > > > > > wrote: > > > > > > > > > > > Finally, it took me a while to run those tests because it was way > > > > > > longer than expected, but here are the results: > > > > > > > > > > > > http://www.spaggiari.org/bonnie.html > > > > > > > > > > > > LVM is not really slower than JBOD and not really taking more > CPU. > > So > > > > > > I will say, if you have to choose between the 2, take the one you > > > > > > prefer. Personally, I prefer LVM because it's easy to configure. > > > > > > > > > > > > The big winner here is RAID0. It's WAY faster than anything else. > > But > > > > > > it's using twice the space... Your choice. > > > > > > > > > > > > I did not get a chance to test with the Ubuntu tool because it's > > not > > > > > > working with LVM drives. > > > > > > > > > > > > JM > > > > > > > > > > > > 2012/11/28, Michael Segel <[email protected]>: > > > > > >> Ok, just a caveat. > > > > > >> > > > > > >> I am discussing MapR as part of a complete response. As Mohit > > posted > > > > > MapR > > > > > >> takes the raw device for their MapR File System. > > > > > >> They do stripe on their own within what they call a volume. > > > > > >> > > > > > >> But going back to Apache... > > > > > >> You can stripe drives, however I wouldn't recommend it. I don't > > > think > > > > > the > > > > > >> performance gains would really matter. > > > > > >> You're going to end up getting blocked first by disk i/o, then > > your > > > > > >> controller card, then your network... assuming 10GBe. > > > > > >> > > > > > >> With only 2 disks on an 8 core system, you will hit disk i/o > first > > > and > > > > > then > > > > > >> you'll watch your CPU Wait I/O climb. > > > > > >> > > > > > >> HTH > > > > > >> > > > > > >> -Mike > > > > > >> > > > > > >> On Nov 28, 2012, at 7:28 PM, Jean-Marc Spaggiari < > > > > > [email protected]> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi Mike, > > > > > >>> > > > > > >>> Why not using LVM with MapR? Since LVM is reading from 2 drives > > > > almost > > > > > >>> at the same time, it should be better than RAID0 or a single > > drive, > > > > > >>> no? > > > > > >>> > > > > > >>> 2012/11/28, Michael Segel <[email protected]>: > > > > > >>>> Just a couple of things. > > > > > >>>> > > > > > >>>> I'm neutral on the use of LVMs. Some would point out that > > there's > > > > some > > > > > >>>> overhead, but on the flip side, it can make managing the > > machines > > > > > >>>> easier. > > > > > >>>> If you're using MapR, you don't want to use LVMs but raw > > devices. > > > > > >>>> > > > > > >>>> In terms of GC, its going to depend on the heap size and not > the > > > > total > > > > > >>>> memory. With respect to HBase. ... MSLABS is the way to go. > > > > > >>>> > > > > > >>>> > > > > > >>>> On Nov 28, 2012, at 12:05 PM, Jean-Marc Spaggiari > > > > > >>>> <[email protected]> > > > > > >>>> wrote: > > > > > >>>> > > > > > >>>>> Hi Gregory, > > > > > >>>>> > > > > > >>>>> I founs this about LVM: > > > > > >>>>> -> http://blog.andrew.net.au/2006/08/09 > > > > > >>>>> -> > > > > > >>>>> > > > > > > > http://www.phoronix.com/scan.php?page=article&item=fedora_15_lvm&num=2 > > > > > >>>>> > > > > > >>>>> Seems that performances are still correct with it. I will > most > > > > > >>>>> probably give it a try and bench that too... I have one new > > hard > > > > > drive > > > > > >>>>> which should arrived tomorrow. Perfect timing ;) > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> JM > > > > > >>>>> > > > > > >>>>> 2012/11/28, Mohit Anchlia <[email protected]>: > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> On Nov 28, 2012, at 9:07 AM, Adrien Mogenet < > > > > > [email protected]> > > > > > >>>>>> wrote: > > > > > >>>>>> > > > > > >>>>>>> Does HBase really benefit from 64 GB of RAM since > allocating > > > too > > > > > >>>>>>> large > > > > > >>>>>>> heap > > > > > >>>>>>> might increase GC time ? > > > > > >>>>>>> > > > > > >>>>>> Benefit you get is from OS cache > > > > > >>>>>>> Another question : why not RAID 0, in order to aggregate > disk > > > > > >>>>>>> bandwidth > > > > > >>>>>>> ? > > > > > >>>>>>> (and thus keep 3x replication factor) > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> On Wed, Nov 28, 2012 at 5:58 PM, Michael Segel > > > > > >>>>>>> <[email protected]>wrote: > > > > > >>>>>>> > > > > > >>>>>>>> Sorry, > > > > > >>>>>>>> > > > > > >>>>>>>> I need to clarify. > > > > > >>>>>>>> > > > > > >>>>>>>> 4GB per physical core is a good starting point. > > > > > >>>>>>>> So with 2 quad core chips, that is going to be 32GB. > > > > > >>>>>>>> > > > > > >>>>>>>> IMHO that's a minimum. If you go with HBase, you will want > > > more. > > > > > >>>>>>>> (Actually > > > > > >>>>>>>> you will need more.) The next logical jump would be to 48 > or > > > > 64GB. > > > > > >>>>>>>> > > > > > >>>>>>>> If we start to price out memory, depending on vendor, your > > > > > company's > > > > > >>>>>>>> procurement, there really isn't much of a price > difference > > in > > > > > terms > > > > > >>>>>>>> of > > > > > >>>>>>>> 32,48, or 64 GB. > > > > > >>>>>>>> Note that it also depends on the chips themselves. Also > you > > > need > > > > > to > > > > > >>>>>>>> see > > > > > >>>>>>>> how many memory channels exist in the mother board. You > may > > > need > > > > > to > > > > > >>>>>>>> buy > > > > > >>>>>>>> in > > > > > >>>>>>>> pairs or triplets. Your hardware vendor can help you. > (Also > > > you > > > > > need > > > > > >>>>>>>> to > > > > > >>>>>>>> keep an eye on your hardware vendor. Sometimes they will > > give > > > > you > > > > > >>>>>>>> higher > > > > > >>>>>>>> density chips that are going to be more expensive...) ;-) > > > > > >>>>>>>> > > > > > >>>>>>>> I tend to like having extra memory from the start. > > > > > >>>>>>>> It gives you a bit more freedom and also protects you from > > > 'fat' > > > > > >>>>>>>> code. > > > > > >>>>>>>> > > > > > >>>>>>>> Looking at YARN... you will need more memory too. > > > > > >>>>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> With respect to the hard drives... > > > > > >>>>>>>> > > > > > >>>>>>>> The best recommendation is to keep the drives as JBOD and > > then > > > > use > > > > > >>>>>>>> 3x > > > > > >>>>>>>> replication. > > > > > >>>>>>>> In this case, make sure that the disk controller cards can > > > > handle > > > > > >>>>>>>> JBOD. > > > > > >>>>>>>> (Some don't support JBOD out of the box) > > > > > >>>>>>>> > > > > > >>>>>>>> With respect to RAID... > > > > > >>>>>>>> > > > > > >>>>>>>> If you are running MapR, no need for RAID. > > > > > >>>>>>>> If you are running an Apache derivative, you could use > RAID > > 1. > > > > > Then > > > > > >>>>>>>> cut > > > > > >>>>>>>> your replication to 2X. This makes it easier to manage > drive > > > > > >>>>>>>> failures. > > > > > >>>>>>>> (Its not the norm, but it works...) In some clusters, they > > are > > > > > using > > > > > >>>>>>>> appliances like Net App's e series where the machines see > > the > > > > > drives > > > > > >>>>>>>> as > > > > > >>>>>>>> local attached storage and I think the appliances > themselves > > > are > > > > > >>>>>>>> using > > > > > >>>>>>>> RAID. I haven't played with this configuration, however > it > > > > could > > > > > >>>>>>>> make > > > > > >>>>>>>> sense and its a valid design. > > > > > >>>>>>>> > > > > > >>>>>>>> HTH > > > > > >>>>>>>> > > > > > >>>>>>>> -Mike > > > > > >>>>>>>> > > > > > >>>>>>>> On Nov 28, 2012, at 10:33 AM, Jean-Marc Spaggiari > > > > > >>>>>>>> <[email protected]> > > > > > >>>>>>>> wrote: > > > > > >>>>>>>> > > > > > >>>>>>>>> Hi Mike, > > > > > >>>>>>>>> > > > > > >>>>>>>>> Thanks for all those details! > > > > > >>>>>>>>> > > > > > >>>>>>>>> So to simplify the equation, for 16 virtual cores we need > > 48 > > > to > > > > > >>>>>>>>> 64GB. > > > > > >>>>>>>>> Which mean 3 to 4GB per core. So with quad cores, 12GB to > > > 16GB > > > > > are > > > > > >>>>>>>>> a > > > > > >>>>>>>>> good start? Or I simplified it to much? > > > > > >>>>>>>>> > > > > > >>>>>>>>> Regarding the hard drives. If you add more than one > drive, > > do > > > > you > > > > > >>>>>>>>> need > > > > > >>>>>>>>> to build them on RAID or similar systems? Or can > > Hadoop/HBase > > > > be > > > > > >>>>>>>>> configured to use more than one drive? > > > > > >>>>>>>>> > > > > > >>>>>>>>> Thanks, > > > > > >>>>>>>>> > > > > > >>>>>>>>> JM > > > > > >>>>>>>>> > > > > > >>>>>>>>> 2012/11/27, Michael Segel <[email protected]>: > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> OK... I don't know why Cloudera is so hung up on 32GB. > ;-) > > > > [Its > > > > > an > > > > > >>>>>>>> inside > > > > > >>>>>>>>>> joke ...] > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> So here's the problem... > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> By default, your child processes in a map/reduce job > get a > > > > > default > > > > > >>>>>>>> 512MB. > > > > > >>>>>>>>>> The majority of the time, this gets raised to 1GB. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> 8 cores (dual quad cores) shows up at 16 virtual > > processors > > > in > > > > > >>>>>>>>>> Linux. > > > > > >>>>>>>> (Note: > > > > > >>>>>>>>>> This is why when people talk about the number of cores, > > you > > > > have > > > > > >>>>>>>>>> to > > > > > >>>>>>>> specify > > > > > >>>>>>>>>> physical cores or logical cores....) > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> So if you were to over subscribe and have lets say 12 > > > mappers > > > > > and > > > > > >>>>>>>>>> 12 > > > > > >>>>>>>>>> reducers, that's 24 slots. Which means that you would > need > > > > 24GB > > > > > of > > > > > >>>>>>>> memory > > > > > >>>>>>>>>> reserved just for the child processes. This would leave > > 8GB > > > > for > > > > > >>>>>>>>>> DN, > > > > > >>>>>>>>>> TT > > > > > >>>>>>>> and > > > > > >>>>>>>>>> the rest of the linux OS processes. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Can you live with that? Sure. > > > > > >>>>>>>>>> Now add in R, HBase, Impala, or some other set of tools > on > > > top > > > > > of > > > > > >>>>>>>>>> the > > > > > >>>>>>>>>> cluster. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Ooops! Now you are in trouble because you will swap. > > > > > >>>>>>>>>> Also adding in R, you may want to bump up those child > > procs > > > > from > > > > > >>>>>>>>>> 1GB > > > > > >>>>>>>>>> to > > > > > >>>>>>>> 2 > > > > > >>>>>>>>>> GB. That means the 24 slots would now require 48GB. Now > > you > > > > > have > > > > > >>>>>>>>>> swap > > > > > >>>>>>>> and > > > > > >>>>>>>>>> if that happens you will see HBase in a cascading > failure. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> So while you can do a rolling restart with the changed > > > > > >>>>>>>>>> configuration > > > > > >>>>>>>>>> (reducing the number of mappers and reducers) you end up > > > with > > > > > less > > > > > >>>>>>>>>> slots > > > > > >>>>>>>>>> which will mean in longer run time for your jobs. (Less > > > slots > > > > == > > > > > >>>>>>>>>> less > > > > > >>>>>>>>>> parallelism ) > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Looking at the price of memory... you can get 48GB or > even > > > > 64GB > > > > > >>>>>>>>>> for > > > > > >>>>>>>> around > > > > > >>>>>>>>>> the same price point. (8GB chips) > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> And I didn't even talk about adding SOLR either again a > > > memory > > > > > >>>>>>>>>> hog... > > > > > >>>>>>>> ;-) > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Note that I matched the number of mappers w reducers. > You > > > > could > > > > > go > > > > > >>>>>>>>>> with > > > > > >>>>>>>>>> fewer reducers if you want. I tend to recommend a ratio > of > > > 2:1 > > > > > >>>>>>>>>> mappers > > > > > >>>>>>>> to > > > > > >>>>>>>>>> reducers, depending on the work flow.... > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> As to the disks... no 7200 SATA III drives are fine. > SATA > > > III > > > > > >>>>>>>>>> interface > > > > > >>>>>>>> is > > > > > >>>>>>>>>> pretty much available in the new kit being shipped. > > > > > >>>>>>>>>> Its just that you don't have enough drives. 8 cores > should > > > be > > > > 8 > > > > > >>>>>>>> spindles if > > > > > >>>>>>>>>> available. > > > > > >>>>>>>>>> Otherwise you end up seeing your CPU load climb on wait > > > states > > > > > as > > > > > >>>>>>>>>> the > > > > > >>>>>>>>>> processes wait for the disk i/o to catch up. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> I mean you could build out a cluster w 4 x 3 3.5" 2TB > > drives > > > > in > > > > > a > > > > > >>>>>>>>>> 1 > > > > > >>>>>>>>>> U > > > > > >>>>>>>>>> chassis based on price. You're making a trade off and > you > > > > should > > > > > >>>>>>>>>> be > > > > > >>>>>>>> aware of > > > > > >>>>>>>>>> the performance hit you will take. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> HTH > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> -Mike > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> On Nov 27, 2012, at 1:52 PM, Jean-Marc Spaggiari < > > > > > >>>>>>>> [email protected]> > > > > > >>>>>>>>>> wrote: > > > > > >>>>>>>>>> > > > > > >>>>>>>>>>> Hi Michael, > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> so are you recommanding 32Gb per node? > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> What about the disks? SATA drives are to slow? > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> JM > > > > > >>>>>>>>>>> > > > > > >>>>>>>>>>> 2012/11/26, Michael Segel <[email protected]>: > > > > > >>>>>>>>>>>> Uhm, those specs are actually now out of date. > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> If you're running HBase, or want to also run R on top > of > > > > > Hadoop, > > > > > >>>>>>>>>>>> you > > > > > >>>>>>>>>>>> will > > > > > >>>>>>>>>>>> need to add more memory. > > > > > >>>>>>>>>>>> Also forget 1GBe got 10GBe, and w 2 SATA drives, you > > will > > > > be > > > > > >>>>>>>>>>>> disk > > > > > >>>>>>>>>>>> i/o > > > > > >>>>>>>>>>>> bound > > > > > >>>>>>>>>>>> way too quickly. > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>> On Nov 26, 2012, at 8:05 AM, Marcos Ortiz < > > [email protected] > > > > > > > > > >>>>>>>>>>>> wrote: > > > > > >>>>>>>>>>>> > > > > > >>>>>>>>>>>>> Are you asking about hardware recommendations? > > > > > >>>>>>>>>>>>> Eric Sammer on his "Hadoop Operations" book, did a > > great > > > > job > > > > > >>>>>>>>>>>>> about > > > > > >>>>>>>>>>>>> this: > > > > > >>>>>>>>>>>>> For middle size clusters (until 300 nodes): > > > > > >>>>>>>>>>>>> Processor: A dual quad-core 2.6 Ghz > > > > > >>>>>>>>>>>>> RAM: 24 GB DDR3 > > > > > >>>>>>>>>>>>> Dual 1 Gb Ethernet NICs > > > > > >>>>>>>>>>>>> a SAS drive controller > > > > > >>>>>>>>>>>>> at least two SATA II drives in a JBOD configuration > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> The replication factor depends heavily of the primary > > use > > > > of > > > > > >>>>>>>>>>>>> your > > > > > >>>>>>>>>>>>> cluster. > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> On 11/26/2012 08:53 AM, David Charle wrote: > > > > > >>>>>>>>>>>>>> hi > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> what's the recommended nodes for NN, hmaster and zk > > > nodes > > > > > for > > > > > >>>>>>>>>>>>>> a > > > > > >>>>>>>> larger > > > > > >>>>>>>>>>>>>> cluster, lets say 50-100+ > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> also, what would be the ideal replication factor for > > > > larger > > > > > >>>>>>>>>>>>>> clusters > > > > > >>>>>>>>>>>>>> when > > > > > >>>>>>>>>>>>>> u have 3-4 racks ? > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> -- > > > > > >>>>>>>>>>>>>> David > > > > > >>>>>>>>>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD > DE > > > LAS > > > > > >>>>>>>>>>>>>> CIENCIAS > > > > > >>>>>>>>>>>>>> INFORMATICAS... > > > > > >>>>>>>>>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > > > > >>>>>>>>>>>>>> > > > > > >>>>>>>>>>>>>> http://www.uci.cu > > > > > >>>>>>>>>>>>>> http://www.facebook.com/universidad.uci > > > > > >>>>>>>>>>>>>> http://www.flickr.com/photos/universidad_uci > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> -- > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> Marcos Luis OrtÃz Valmaseda > > > > > >>>>>>>>>>>>> about.me/marcosortiz <http://about.me/marcosortiz> > > > > > >>>>>>>>>>>>> @marcosluis2186 <http://twitter.com/marcosluis2186> > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE > > LAS > > > > > >>>>>>>>>>>>> CIENCIAS > > > > > >>>>>>>>>>>>> INFORMATICAS... > > > > > >>>>>>>>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION > > > > > >>>>>>>>>>>>> > > > > > >>>>>>>>>>>>> http://www.uci.cu > > > > > >>>>>>>>>>>>> http://www.facebook.com/universidad.uci > > > > > >>>>>>>>>>>>> http://www.flickr.com/photos/universidad_uci > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> -- > > > > > >>>>>>> Adrien Mogenet > > > > > >>>>>>> 06.59.16.64.22 > > > > > >>>>>>> http://www.mogenet.me > > > > > >>>>>> > > > > > >>>>> > > > > > >>>> > > > > > >>>> > > > > > >>> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
