First off, I'd like to thank you very much for your interest and involvment in 
Tashi.  I've tried to respond to the specific issues listed:

Priority 1: Authentication/Encryption

I agree this is a high priority item.  A student working over the summer 
(Michael Wang) modified Tashi to use RPyC, which provides a user authentication 
mechanism as well as a secure channel for requests.  We haven't done extensive 
testing, but it appears to provide most of what we want.  It requires some 
manual configuration at this point, but I'd like to know if for some reason 
this is not a satisfactory approach in general for you before I dig deeper.

Priority 2: Network configuration

I agree that this will likely be an ongoing issue.  In our current 
infrastructure, we have a DMZ (with 10 public IPs), a general network, and 
several private VLANs.  We have assumed control of the DMZ and the general 
network, but are having users run their own DNS and DHCP servers in the private 
VLANs.  I agree completely with the strategy you suggest -- implement what we 
need now with an eye toward future extensions.

Priority 3: Site-specific plugins

This is similar to the last point in that we need to implement what we need now 
while trying to keep it extensible, but we won't really know all the 
requirements until more sites are using Tashi.

Priority 4: VM scheduling model

The basic scheduler (primitive.py) doesn't do much in this space.  We have, 
however, implemented a bridge that allows the use of Maui, a resource 
scheduler, to control VM creation.  This should allow the use of more advanced 
scheduling techniques for things like priorities and quotas.  A basic system of 
billing would be possible by using this as well, but it would seem advantageous 
to have Tashi support a more direct and systematic form of billing.

Priority 5: Physical boot

We have looked at this a fair bit and there seems to be two basic conclusions 
we have drawn.  One is that if we properly isolate physical machines (VLANs and 
routing and other techniques), we can limit a rogue DHCP server from affecting 
the entire cluster and have it only affect a private VLAN (presumably owned and 
managed by one user or group).  We are working with others at HP on a project 
called PRS that will is responsible for the physical booting.  It will 
automatically reprogram switches and other networking infrastructure to limit 
the access of an end-host and setup servers to perform the PXE booting.  The 
other conclusion is that, in general, current hardware lacks the ability to 
limit modifications to the BIOS and other system hardware by a priveledged user 
in the operating system.

We have thought of dealing with these problems by, as mentioned above, limiting 
the impact using network isolation and disincentivizing the later 
problem/bahavior by using a billing system that will bill a user until a 
machine is returned (ie. it PXE boots a base image we provide).  And as you 
mention, this feature is just beginning to materialize.

Priority 6: Multi-VM job control

This may be solvable by using Maui as the scheduler, but I agree that this is a 
scheduler-only change and shouldn't be tremendously difficult with respect to 
Tashi (synchronized operations are always a little challenging in a cluster).

To respond to you rquestion about joining and proposing and developing 
solutions, I'd like to warmly welcome you to do so.  I have sent this email to 
the tashi-dev mailing list and BCC'd all of the original recipients (to avoid 
exposing email addresses).  I'd be happy to continue any discussion on the 
mailing list.  You can join the mailing list by emailing 
[email protected].  Additionally, if you have code, 
patches, ideas, or documentation to contribute, sending it to the list is the 
right way to get it applied to SVN.  The basic way forward is for us to 
continue this discussion by exchaning ideas and code.  Assuming you want to get 
even more involved, we could look into making one or more of you committers 
after some further interactions.

In terms of testing, I haven't written much documentation.  The procude works 
roughly as follows:

1. Install on a small testbed (2-3) nodes and test all basic features as well 
as any new functionality.
2. If the change affects the cluster manager, stop the scheduler, backup the 
CM's data, update the software and restart the CM and scheduler on the 
production cluster.
3. Incrementally update the software on the nodes, simply killing the node 
manager process and restarting it (everything should automatically reload).  
Again, this is on our production cluster.

Obviously, in the cases where the data format checkpointed by the node manager 
changed, that must be updated between the exit and the restart.

Again, thank you very much for your time and energy.  I appreciate the detailed 
analysis of the current system and look forward to working with you in the 
future.

- Michael

-----Original Message-----
From: Sheen, Robert 
Sent: Thursday, September 17, 2009 6:02 AM
To: Ryan, Michael P
Subject: RE: Support of Tashi

Dear Ryan,

        This is Robert Sheen at Taiwan HP. I would like to ask your support to 
help III to solve the questions of Tashi, III is planning to join Open Cirrus 
and have already installed Tashi on their site. Your help will be very helpful 
to speedup the collaboration, thanks in advance.  

        III Dr. Hsieh as in the cc list. After his studying the Tashi slides, 
there are bellows known issues, Dr Hsieh would like to know what is the current 
status of these known issues, and if III want to join to propose and develop 
solutions for these known issues, how to proceed? What procedure need to take? 
Thanks!

        2nd question is III is drafting a test plan for the Tashi environment. 
Mr., Chen would like to ask help on any exist test procedure document to 
reference. Thanks!


•    Priority 1: Authentication/Encryption

–    Virtual cluster owner authentication has not been resolved in the current 
Tashi implementation

–    Plan: select a user account management scheme soon and implement (probably 
via SSL)

•    Priority 2: Network configuration

–    Site-specific network configuration will probably be an on-going thorny 
issue.  How many global IP addresses are available?  Which private subnets are 
available?  Do the physical cluster owners have control over local DHCP/DNS 
servers? Etc.

–    Plan: implement something that works for the first few Tashi sites, 
architect the site-specific plugin to enable modification, adapt as new needs 
surface

•    Priority 3: Site-specific plugins 

–    Are agents capable of doing all of the site-specific logic needed to 
create and manage VMs?

–    Plan: Solicit feedback from partners to determine for which steps in VM 
creation/activation customization is critical

•    Priority 4: VM scheduling model

–    Tashi does not currently have a well-integrated scheduler that supports VM 
priorities, quotas, billing, etc.

–    Plan: Implement features on “as needed” basis

•    Priority 5: Physical boot

–    A number of security concerns have surfaced here if the owner of the 
physically-booted machine is not completely trusted (or if a trusted, but 
naïve, owner’s machine becomes compromised). What if a DHCP server is started 
that competes with the cluster’s server?  If we rely on PXE boot to regain 
control, can we prevent a physical owner from reprogramming the BIOS to prevent 
PXE boot?  What are the best monitoring/control options? Etc.

–    Plan: do not offer physical boot in Tashi until security model is better 
understood

•    Priority 6: Multi-VM job control

–    The current scheduling agent activates VMs one at a time.  A transactional 
mechanism needs to be added that only starts a VM group if there is room to 
accommodate the entire group and enables easy tear-down if any portion of the 
group fails

–    Plan: Extend scheduler with such a feature, should be straight-forward

 

Best Rgrds,
Robert Sheen
沈 仲 杰
HP TSG Pre-Sales
Solution Manager

Reply via email to