Hi Kesten,

As you likely know (given your arguments against), using virtualization to a Hadoop stack can introduce some unintended consequences. Hadoop has a lot of heartbeats between processes to determine system "aliveness". If your infrastructure is overloaded, Hadoop can really suffer from spikes in latency.

Accumulo is much the same way, arguably a bit more. Accumulo's processes are very dependent on maintaining a lock in ZooKeeper (every 30 seconds by default) instead of RPC calls between DataNodes and NameNodes. Accumulo's node failure tends to be much more expensive than HDFS' because Accumulo wants to make sure every tablet is available without significant downtime. Hadoop has multiple replicas for each file so it can be a bit more lazy about noticing failure and re-replicating. What I've typically heard is that running Accumulo in a virtualized environment makes administration and use a bit more difficult.

If you're considering running HDFS on baremetal, I would encourage you do to the same with Accumulo or investigate something like YARN (really, HOYA https://github.com/hortonworks/hoya/) to do dynamic provisioning. Accumulo has the ability to happily scale and run across many nodes, so you shouldn't have to worry about large installation problems (in other words: one Accumulo instance should be sufficient for a cluster). YARN/HOYA gives you the dynamic allocations on top of your cluster to have the ease of spinning up and down Accumulo clusters as you want/need them.

On 11/5/13, 3:21 PM, Kesten Broughton wrote:
I've seen arguments both for and against virtualizing hadoop/hdfs.
(the arguments for were from vmware :)

We are considering hdfs on baremetal, with accumulo being virtualized.
This would serve a fairly constant amount of data but widely varying compute 
demands.
Has anyone tried this?  Can anyone share their experience with 
baremetal/virtualization with accumulo?

thanks

kesten
(first post)

Reply via email to