Well, I am probably the only one outside of "the labs" that stuck it
out and had an
xcpu cluster running users' jobs for several months.  I am very sad
about its demise...

To me the big missing pieces were:  scheduler and MPI.  Even though mvapich was
kind of working, it never really got debugged enough.  And the bjs port remained
buggy too.  If those two had worked properly, the cluster would still
be running xcpu.
I am not managing it anymore, so it has gone to caoslinux with torque/maui.
I still have a couple of bproc clusters running...

Now we manage a 4,000 node cluster using moab, xCAT,
diskless/stateless, but with
a "real" os image on every node.  It works, even if it is ugly.  Enough said...

Daniel

On Tue, Nov 24, 2009 at 3:43 PM, Andrew Shewmaker <[email protected]> wrote:
>
> On Tue, Nov 24, 2009 at 1:11 PM, Eric Van Hensbergen <[email protected]> wrote:
>> Those are implementation specifics that the user/admin can be largely
>> unaware of.  It would be quite trivial to assume the same environment
>> as .ssh (system-level password authentication and/or key-files on a
>> shared file system).
>>
>> The unfortunate side of that is it requires shared distributed file
>> system or shared auth mechanisms be present which mean you require
>> something more than the drone systems we currently deploy with xcpu2
>> which are much easier to manage.
>
> We don't necessarily use a shared distributed file system for things
> like system keys.  Since they don't change often, we may put them into
> a RAM root image and perhaps update them with a tree'd remote copy.
>
> I want to clarify what I said before, since I combined authentication
> and account authorization.  In addition to something like ssh key
> authentication, resource managers like torque use PAM to determine
> which accounts are active on a node at a given time.
>
> Now, I'm not particularly fond of any of the existing resource
> managers, so I would be content if a scheduler (Moab in our case)
> talked directly to xcpu2.  We also need tight integration with MPI
> implementations.  Currently we have a situation where the resource
> manager has to establish connections to all of the nodes in an
> allocation, then MPI has to do the same sort of wireup.  I understand
> that it is non-trivial to get Open MPI to utilize xcpu.
>
> --
> Andrew Shewmaker
>

Reply via email to