XenFolk,

I was given the task of bringing up some useful virtual servers under
OpenSolaris on a SunFire X2200, as a demonstration.  After moving it to
a colo, I find that it crashes on occasion for no discernible reason.
And the remote console doesn't seem to work remotely!  I need help or
suggestions for debugging.

Details:

I brought up the system in my basement, and was able to access serial
console, video console, eLOM SSH, eLOM Web, and Web remote console over
my home network.  Attempts to create a RAID drive on it showed that it
didn't really have hardware RAID, only assistance for software RAID.
After installing the OpenSolaris CD I had downloaded, I realized that it
was a minimal distribution.  I downloaded SXCE version 107, and
installed it using ZFS RAID to mirror the two 1Tb drives together.  I
re-booted it as Solaris xVM dom0.  I installed fully virtual copies of
SXCE, Fedora Linux 10, and Ubuntu Linux 8.10 server, somewhere along the
line fixing a buggy script /usr/lib/xen/scripts/vbd-check with all the
quotes that it needs [vs. a limited number in the official fix that I
later found].  It seemed to work fine.

I disconnected my VT200, monitor, keyboard, mouse, etc. and went to a
colo where it was installed with only network connections to the eLOM
connection and the system network port.  I verified that there was SSH
access to all virtual machines and the eLOM, and left.

After a few days, I found problems browsing to or SSH'ing in to the
virtual systems.  I believe that, the first time, neither Linux VM
responded to Web or SSH, and the SXCE VM was slow to respond to SSH.
When I logged into the dom0 SXCE, I found that various commands would
not run, even though an 'ls' showed them in their "bin" directories, and
testing showed that I could not get to the contents of those or other
files.  I could still SSH to the eLOM port, and I could also browse to
the SSL Web port for the eLOM, so I could re-boot the system.  It came
back up with all four installed OSes (one dom0 and three domU) working
fine.  Since then, this has happened at intervals between one day and
several weeks [I was hoping that magic had struck], and when it happens
I usually cannot log in at all to the systems, vs. what happened that
first time.  I am not constantly monitoring them, so it is likely that
the first time I chanced on a degraded mode on its way to crashing.

About the SXCE domU that was slow to respond - on that one, I NFS
auto-mount my home directory from the dom0 to the domU.  I was going to
do that for the Linux domU's as well, but (a) it wasn't working, and (b)
they wanted different contents in the user home directories.  Anyway,
after each re-boot the NFS auto-mount takes much longer than I think it
should.  This may or may not be an actual bug that needs fixing.

As far as what appears on the console, I wish I could tell you.  The
remote console worked fine when I was on the same network, but over the
Internet it looks like SSH and the HTTPS Web interface work, but when I
invoke the remote console from the same machines that worked locally,
the Java machine loads javaRKVM.jnlp, grinds for a while, asks about the
certificates [twice], and then spits out an "IOException / Create
Connection Failure!" window, and the Sun eLOM Remote Console window just
sits there blankly.  If it matters, my network is NATted from the world;
but the documentation does not say anything about this or other
potential problems.

Has anyone experienced these problems before?  Does anyone have a fix
for them?  Does anyone have suggestions as to how they may be debugged?
Thanks!


-- 
/*********************************************************************\
**
** Joe Yao                              [email protected] - Joseph S. D. Yao
**
\*********************************************************************/
_______________________________________________
xen-discuss mailing list
[email protected]

Reply via email to