Ah, that was you :)

You can find the documentation at: https://github.com/hortonworks/hoya/tree/master/src/site/markdown, specifically you'd be interested in https://github.com/hortonworks/hoya/blob/master/src/site/markdown/building.md. I'll try to see if I can get the documentation links fixed.

HOYA uses Hadoop's YARN to perform this provisioning. It uses HDFS as some shared storage, and then leverages the YARN APIs for running across a cluster. In actuality, it wouldn't matter whether you're running on bare metal or on virtualized hosts.

On 11/6/13, 1:41 PM, Kesten Broughton wrote:
Thanks a lot for the quick responses.

So donald, if I get you correctly, you recommend against a hybrid approach, but 
if multi-tenancy and resource utilization are big factors (they are) then a 
pure virtualized approach might be appropriate?  It's just a far less trodden 
path.  We are working towards an openstack environment, which may help with the 
networking configuration component.

Hoya looks interesting, but unfortunately all the links are currently 404 
landspeeder (i submitted an issue).
As far as utilization goes, is hoya pxe/cobbler booting bare-metal from a 
bare-metal resource pool?   This would certainly be much slower, but might be 
suitable.
Or would we have to load our bare-metal pool with all the resources in our 
stack and remove it from one thing (elasticsearch cluster say) and add it to 
accumulo.
That might be sane.

This is good food for thought as we consider our options.

kesten
________________________________________
From: Donald Miner [[email protected]]
Sent: Tuesday, November 05, 2013 2:45 PM
To: [email protected]
Subject: Re: virtualize accumulo?

I think a hybrid approach is probably too much pain than its worth. The 
configuration of the networking and the IP addresses across virtual and 
physical hosts will be challenging but not impossible. Also, what are you 
trying to isolate accumulo from? MapReduce perhaps? A large storm instance? 
Either way, you'll have to think about how to virtualize and provision those 
things, too. Now your host is dealing with VMs and HDFS services. None of these 
are really show-stopper excuses, so you really could do what you are trying to 
do, but you'd be paving your own way.

I'm pretty sure I agree with Josh on this one, but wanted to explain the pure 
virtualization option.

The VMWare thing you mentioned might have been this thing:
http://www.vmware.com/products/big-data-extensions/features.html (marketing)
http://www.vmware.com/files/pdf/Hadoop-Virtualization-Extensions-on-VMware-vSphere-5.pdf
 (more technical, but less breadth)

I'm a big proponent of these as they really do solve a couple fundamental problems (disclaimer: I 
use to work for Pivotal, who helped pushed this solution). The neat thing they added in the 
extensions was the understanding of data locality between TaskTrackers and DataNodes if they reside 
on the same physical host in different virtual machines. This means that jobs would get assigned to 
TTs within the same "node group", which is nice for a couple reasons. Most prominantly, 
it allows you to separate the HDFS and MR services into different VMs while maintaining data 
locality. This is good for scaling compute separate from storage, particularly in a multi-tenant 
environment. Another cool thing is you can "shut off" the execution environment: spin 
down the VMs with the TTs but leave the DNs alone. There are some other things they did to make 
this architecture make more sense.

So getting back to your question, hypothetically, you could have multiple HDFS 
instances on the same cluster (neat), each supporting one or more Accumulo 
instances, each of which can be handled independently of one another. Your MR 
and other things can also use VMs and you have pretty good resource utilization 
compartmentalization. his would give you multi tenancy and would allow you to 
manage separate services running over HDFS as separate clusters. You could also 
stop region servers while keeping HDFS (and perhaps MapReduce alive), which 
could be interesting if you want to start up a proof of concept but don't need 
the service to be live all the time.

In that VMWare paper they mention that performance actually increases with this 
DN/TT separation scheme over bare metal, but be wary of the numbers. There is 
no doubt overhead in having a virtualization layer. But, if multi-tenancy and 
elasticity are important to you, this could be one way to perform that tradeoff.

-Don




On Tue, Nov 5, 2013 at 3:31 PM, Josh Elser 
<[email protected]<mailto:[email protected]>> wrote:
Hi Kesten,

As you likely know (given your arguments against), using virtualization to a Hadoop stack 
can introduce some unintended consequences. Hadoop has a lot of heartbeats between 
processes to determine system "aliveness". If your infrastructure is 
overloaded, Hadoop can really suffer from spikes in latency.

Accumulo is much the same way, arguably a bit more. Accumulo's processes are 
very dependent on maintaining a lock in ZooKeeper (every 30 seconds by default) 
instead of RPC calls between DataNodes and NameNodes. Accumulo's node failure 
tends to be much more expensive than HDFS' because Accumulo wants to make sure 
every tablet is available without significant downtime. Hadoop has multiple 
replicas for each file so it can be a bit more lazy about noticing failure and 
re-replicating. What I've typically heard is that running Accumulo in a 
virtualized environment makes administration and use a bit more difficult.

If you're considering running HDFS on baremetal, I would encourage you do to 
the same with Accumulo or investigate something like YARN (really, HOYA 
https://github.com/hortonworks/hoya/) to do dynamic provisioning. Accumulo has 
the ability to happily scale and run across many nodes, so you shouldn't have 
to worry about large installation problems (in other words: one Accumulo 
instance should be sufficient for a cluster). YARN/HOYA gives you the dynamic 
allocations on top of your cluster to have the ease of spinning up and down 
Accumulo clusters as you want/need them.


On 11/5/13, 3:21 PM, Kesten Broughton wrote:
I've seen arguments both for and against virtualizing hadoop/hdfs.
(the arguments for were from vmware :)

We are considering hdfs on baremetal, with accumulo being virtualized.
This would serve a fairly constant amount of data but widely varying compute 
demands.
Has anyone tried this?  Can anyone share their experience with 
baremetal/virtualization with accumulo?

thanks

kesten
(first post)



Reply via email to