Re: HBase / HDFS on EBS?

Matt Corgan Wed, 05 Jan 2011 08:32:48 -0800

Yeah, EC2 tends to have both low network and disk throughput.  We use it
because of the flexibility and because throughput isn't a big concern for
us.


Also, fyi, I'm pretty sure from my tests that EBS is on a separate interface
than the regular network (like Phil said), so EBS starving your internode
bandwidth shouldn't be a concern.


On Wed, Jan 5, 2011 at 11:16 AM, Lars George <[email protected]> wrote:

> Hi,
>
> I ran some tests on various EC2 clusters from c1.medium, c1.xlarge to
> m2.2xlarge with EBS on 1+10 instances. The instance storage usually
> averages at around 2-3 MB/s for writes and the EBS backed m2.2xlarge
> did 7-8 MB/s on writes. Reading I think is less an issue, but writing
> is really bad. On a dedicated cluster I expect to see at leat 15 MB/s
> but have seen 25 MB/s on quite average co-located servers. EBS is
> better, but still bad for large ETL jobs. I had one use-case where a
> single (!) machine with a single threaded app could do an ETL job in
> about 50mins while an EC2 cluster doing the same on 1+10 nodes took
> 30hrs! Go figure. And I added a "dry-run" switch that would do the
> whole reading and parsing stuff, just no writing and they finished in
> 45mins. So this was definitely write bound.
>
> One take away is to watch and expect a huge deviation in performance.
> And a rule of thumb may be: If you have a well performing EC2 cluster,
> do not shut it down if you can avoid it. Or spin up a few, do a burn
> in and select the fastest.
>
> Lars
>
>
> On Wed, Jan 5, 2011 at 4:50 PM, Matt Corgan <[email protected]> wrote:
> > Hi Otis,
> >
> > I think it might be difficult to interpret the results of running all the
> > different nodes in the same cluster.  I would recommend running your test
> > once with N nodes using local disk, then run again with N nodes using 1
> EBS
> > volumes, then run again with N nodes using X EBS volumes.
> >
> > Do you know if your workload is most likely to be restricted by CPU,
> memory,
> > disk throughput, disk space, or disk seeks?  EBS helps most with the last
> 2,
> > but don't overlook how expensive it can be.
> >
> > We're mostly disk seek limited, so we mount 6x100GB EBS volumes on each
> > m1.large server and don't even use the local disks in order to keep
> things
> > simple.  If that proves not enough and the servers can still handle it,
> > we'll probably add new servers to the cluster with 12x100GB and then
> slowly
> > remove the old ones.  These are not in a RAID configuration like we do
> for
> > MySQL, just listed in the hdfs-site.xml file:
> >
> > <property>
> >  <name>dfs.data.dir</name>
> >
>  
> <value>/mnt/hdfs/ebs1,/mnt/hdfs/ebs2,/mnt/hdfs/ebs3,/mnt/hdfs/ebs4,/mnt/hdfs/ebs5,/mnt/hdfs/ebs6</value>
> > </property>
> >
> > Hope that helps,
> > Matt
> >
> >
> > On Wed, Jan 5, 2011 at 1:44 AM, Otis Gospodnetic <
> [email protected]
> >> wrote:
> >
> >> Hi,
> >>
> >> I think this bit from Matt and the last bit from Phil about a
> >> drive-per-cpu-core
> >> seem like strong arguments in favour of EBS.
> >> I don't have a good feel/experience for speed when storage medium is on
> the
> >> other side of a *fibre* link vs. completely local disk.
> >> The fact that everything is shared and the intensity of its use by
> others
> >> sharing the resources varies makes EBS vs. local super hard to properly
> >> compare.
> >>
> >> How about doing this to compare performance and cost:
> >> * create N EC2 instances
> >> * on half of them configure Hadoop HDFS/MR to use local disk
> >> * on a quarter of them configure Hadoop HDFS/MR to use 1 EBS volume
> >> * on a quarter of them configure Hadoop HDFS/MR to use N EBS volumes
> >> * run your regular MR jobs
> >> * compare performance
> >> * look at the EBS section on the AWS monthly bill
> >>
> >> Q1: does above sound good or is there a way to improve this?
> >> Q2: what's the best way to compare performance of different nodes other
> >> than
> >> manually checking various Hadoop UIs to see how long Map and Reduce
> tasks
> >> on
> >> different nodes *tend* to take?
> >>
> >> The above is really more about HDFS/MR performance on local vs. EBS
> disks.
> >> If each of the above nodes also runs HBase RegionServer, how would one
> see
> >> which
> >> group of them is the fastest, which the slowest?
> >> Is there a "rows per second" sort of metric somewhere that would show
> how
> >> fast
> >> different RSs are?
> >>
> >> Thanks,
> >> Otis
> >> ----
> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> Lucene ecosystem search :: http://search-lucene.com/
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Matt Corgan <[email protected]>
> >> > To: user <[email protected]>
> >> > Sent: Tue, January 4, 2011 2:36:51 PM
> >> > Subject: Re: HBase / HDFS on EBS?
> >> >
> >> > One nice thing is that you can create many small EBS volumes per
> >>  instance,
> >> > and since each EBS volume does ~100 IOPS you can get really good
> >>  aggregate
> >> > random read performance.
> >> >
> >> >
> >> > On Tue, Jan 4, 2011 at 2:05 PM,  Phil Whelan <[email protected]>
> wrote:
> >> >
> >> > > Hi  Otis,
> >> > >
> >> > > I have used Hadoop on EBS, but not HBase yet (apologies  for not
> being
> >> HBase
> >> > > specific).
> >> > >
> >> > > * Supposedly ephemeral  disks can be faster, but EC2 claims EBS is
> >> faster.
> >> > > > People who  benchmarked EBS mention its performance varies a lot.
> >>  Local
> >> > > >  disks
> >> > > > suffer from noisy neighbour problem, no?
> >> > >  >
> >> > >
> >> > > EBS Volumes are much faster than the local EC2 image's  local disk,
> in
> >> my
> >> > > experience.
> >> > >
> >> > >
> >> > > > * EBS disks  are not local.  They are far from the CPU.  What
> happens
> >> with
> >> > >  > data
> >> > > > locality if you have data on EBS?
> >> > >  >
> >> > >
> >> > > Amazon uses local *fibre* network to connect EBS to the  machine, so
> >> that is
> >> > > not much of a problem.
> >> > >
> >> > >
> >> > > >  * MR jobs typically read and write a lot.  I wonder if this ends
> up
> >> being
> >> > > > very
> >> > > > expensive?
> >> > > >
> >> > >
> >> > >  Costs do tend to creep up on AWS. On the plus side, you can roughly
> >> > >  calculate how expensive you MR jobs will be. Using your own
> hardware
> >> is
> >> > >  definitely more cost effective.
> >> > >
> >> > >
> >> > > > * Data on ephemeral  disks is lost when an instance terminates.
>  Do
> >> people
> >> > > >  really
> >> > > > rely purely on having N DNs and high enough replication  factor to
> >> prevent
> >> > > > data
> >> > > > loss?
> >> > >  >
> >> > >
> >> > > I found local EC2 image disks far slower than EBS, so  stopped using
> >> them. I
> >> > > do not recall losing more than one EBS volume, but  I've lost a many
> >> EC2
> >> > > instances (and the local disk with it). Now I  always choose
> EBS-backed
> >> EC2
> >> > > instances.
> >> > >
> >> > > * With EBS you  could just create a larger volume when you need more
> >> disk
> >> > > >  space
> >> > > > and attach it to your existing DN.  If you are running  out of
> disk
> >> space
> >> > > on
> >> > > > local disks, what are the  options?  Got to launch more EC2
> instances
> >> even
> >> > > > if all
> >> > >  > you need is disk space, not more CPUs?
> >> > > >
> >> > >
> >> > > Yes,  you cannot increase the local disk space on EC2 instance
> without
> >> > > getting  a larger instance. As I understand, it is good for Hadoop
> to
> >> have
> >> > > one  disk per cpu core for MR.
> >> > >
> >> > > Thanks,
> >> > > Phil
> >> > >
> >> > >  --
> >> > > Twitter : http://www.twitter.com/philwhln
> >> > > LinkedIn : http://ca.linkedin.com/in/philwhln
> >> > > Blog : http://www.philwhln.com
> >> > >
> >> > >
> >> > > > Thanks,
> >> > > >  Otis
> >> > > > ----
> >> > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> > > >  Lucene ecosystem search :: http://search-lucene.com/
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: HBase / HDFS on EBS?

Reply via email to