Hi Ted, > You appear to be running on about 10 disks total. > Each disk should be > capable of about 100 ops per second but they appear to be > doing about 70. > This is plausible overhead.
Each c1.xlarge instance has 4 ephemeral disk. However, I forgot to modify my script to mount the other 2 ephemeral disk and add them to dfs.data.dir. So, it should be running on 20 disks total. That would make it 100 ops per second vs 35 ops per second? Is that still a plausible overhead? Is there a difference to the performance if I add the 4 disks to the dfs.data.dir vs. setting a raid-0 of the 4 ephemeral disks and have a single location for dfs.data.dir? I'll also try your suggestion of using multiple ebs stores. > > Is your actual load going to be completely uniformly > random? Or will there > be a Zipf distribution? Will there be burst of > repeated accesses? > > Uniform random can be a reasonably good approximation if > you are running > behind a cache large enough to cache all repeated > accesses. If you aren't > behind a cache, uniform access might be very unrealistic > (and pessimistic). > > Do you have logs that you can use to model your actual read > behaviors? > Right now, I'm just playing with completely uniformly random. However, I have also tried a Zipf distribution and the throughput seems to saturate at around 1.2k ops per second. I actually don't have logs to model my read behaviors. I'm using HBase as part of my research project. Thanks, Harold
