Answers in-line. On Wed, Jun 1, 2011 at 12:42 AM, Harold Lim <[email protected]> wrote:
> Hi Ted, > > > You appear to be running on about 10 disks total. > > Each disk should be > > capable of about 100 ops per second but they appear to be > > doing about 70. > > This is plausible overhead. > > > Each c1.xlarge instance has 4 ephemeral disk. However, I forgot to modify > my script to mount the other 2 ephemeral disk and add them to dfs.data.dir. > So, it should be running on 20 disks total. That would make it 100 ops per > second vs 35 ops per second? Is that still a plausible overhead? > Potentially. Hbase may need to read several locations to access your data since it effectively overlays multiple hfiles. > Is there a difference to the performance if I add the 4 disks to the > dfs.data.dir vs. setting a raid-0 of the 4 ephemeral disks and have a single > location for dfs.data.dir? > I would avoid raid-0 > > Uniform random can be a reasonably good approximation if > > you are running > > behind a cache large enough to cache all repeated > > accesses. If you aren't > > behind a cache, uniform access might be very unrealistic > > (and pessimistic). > > > > Do you have logs that you can use to model your actual read > > behaviors? > > > > Right now, I'm just playing with completely uniformly random. However, I > have also tried a Zipf distribution and the throughput seems to saturate at > around 1.2k ops per second. > Harumph. What about data that prefers recently accessed keys?
