Ah, that was you :)
You can find the documentation at:
https://github.com/hortonworks/hoya/tree/master/src/site/markdown,
specifically you'd be interested in
https://github.com/hortonworks/hoya/blob/master/src/site/markdown/building.md.
I'll try to see if I can get the documentation links fixed.
HOYA uses Hadoop's YARN to perform this provisioning. It uses HDFS as
some shared storage, and then leverages the YARN APIs for running across
a cluster. In actuality, it wouldn't matter whether you're running on
bare metal or on virtualized hosts.
On 11/6/13, 1:41 PM, Kesten Broughton wrote:
Thanks a lot for the quick responses.
So donald, if I get you correctly, you recommend against a hybrid approach, but
if multi-tenancy and resource utilization are big factors (they are) then a
pure virtualized approach might be appropriate? It's just a far less trodden
path. We are working towards an openstack environment, which may help with the
networking configuration component.
Hoya looks interesting, but unfortunately all the links are currently 404
landspeeder (i submitted an issue).
As far as utilization goes, is hoya pxe/cobbler booting bare-metal from a
bare-metal resource pool? This would certainly be much slower, but might be
suitable.
Or would we have to load our bare-metal pool with all the resources in our
stack and remove it from one thing (elasticsearch cluster say) and add it to
accumulo.
That might be sane.
This is good food for thought as we consider our options.
kesten
________________________________________
From: Donald Miner [[email protected]]
Sent: Tuesday, November 05, 2013 2:45 PM
To: [email protected]
Subject: Re: virtualize accumulo?
I think a hybrid approach is probably too much pain than its worth. The
configuration of the networking and the IP addresses across virtual and
physical hosts will be challenging but not impossible. Also, what are you
trying to isolate accumulo from? MapReduce perhaps? A large storm instance?
Either way, you'll have to think about how to virtualize and provision those
things, too. Now your host is dealing with VMs and HDFS services. None of these
are really show-stopper excuses, so you really could do what you are trying to
do, but you'd be paving your own way.
I'm pretty sure I agree with Josh on this one, but wanted to explain the pure
virtualization option.
The VMWare thing you mentioned might have been this thing:
http://www.vmware.com/products/big-data-extensions/features.html (marketing)
http://www.vmware.com/files/pdf/Hadoop-Virtualization-Extensions-on-VMware-vSphere-5.pdf
(more technical, but less breadth)
I'm a big proponent of these as they really do solve a couple fundamental problems (disclaimer: I
use to work for Pivotal, who helped pushed this solution). The neat thing they added in the
extensions was the understanding of data locality between TaskTrackers and DataNodes if they reside
on the same physical host in different virtual machines. This means that jobs would get assigned to
TTs within the same "node group", which is nice for a couple reasons. Most prominantly,
it allows you to separate the HDFS and MR services into different VMs while maintaining data
locality. This is good for scaling compute separate from storage, particularly in a multi-tenant
environment. Another cool thing is you can "shut off" the execution environment: spin
down the VMs with the TTs but leave the DNs alone. There are some other things they did to make
this architecture make more sense.
So getting back to your question, hypothetically, you could have multiple HDFS
instances on the same cluster (neat), each supporting one or more Accumulo
instances, each of which can be handled independently of one another. Your MR
and other things can also use VMs and you have pretty good resource utilization
compartmentalization. his would give you multi tenancy and would allow you to
manage separate services running over HDFS as separate clusters. You could also
stop region servers while keeping HDFS (and perhaps MapReduce alive), which
could be interesting if you want to start up a proof of concept but don't need
the service to be live all the time.
In that VMWare paper they mention that performance actually increases with this
DN/TT separation scheme over bare metal, but be wary of the numbers. There is
no doubt overhead in having a virtualization layer. But, if multi-tenancy and
elasticity are important to you, this could be one way to perform that tradeoff.
-Don
On Tue, Nov 5, 2013 at 3:31 PM, Josh Elser
<[email protected]<mailto:[email protected]>> wrote:
Hi Kesten,
As you likely know (given your arguments against), using virtualization to a Hadoop stack
can introduce some unintended consequences. Hadoop has a lot of heartbeats between
processes to determine system "aliveness". If your infrastructure is
overloaded, Hadoop can really suffer from spikes in latency.
Accumulo is much the same way, arguably a bit more. Accumulo's processes are
very dependent on maintaining a lock in ZooKeeper (every 30 seconds by default)
instead of RPC calls between DataNodes and NameNodes. Accumulo's node failure
tends to be much more expensive than HDFS' because Accumulo wants to make sure
every tablet is available without significant downtime. Hadoop has multiple
replicas for each file so it can be a bit more lazy about noticing failure and
re-replicating. What I've typically heard is that running Accumulo in a
virtualized environment makes administration and use a bit more difficult.
If you're considering running HDFS on baremetal, I would encourage you do to
the same with Accumulo or investigate something like YARN (really, HOYA
https://github.com/hortonworks/hoya/) to do dynamic provisioning. Accumulo has
the ability to happily scale and run across many nodes, so you shouldn't have
to worry about large installation problems (in other words: one Accumulo
instance should be sufficient for a cluster). YARN/HOYA gives you the dynamic
allocations on top of your cluster to have the ease of spinning up and down
Accumulo clusters as you want/need them.
On 11/5/13, 3:21 PM, Kesten Broughton wrote:
I've seen arguments both for and against virtualizing hadoop/hdfs.
(the arguments for were from vmware :)
We are considering hdfs on baremetal, with accumulo being virtualized.
This would serve a fairly constant amount of data but widely varying compute
demands.
Has anyone tried this? Can anyone share their experience with
baremetal/virtualization with accumulo?
thanks
kesten
(first post)