On Sat, Jul 19, 2014 at 11:19:03PM +0300, Valentine Zaretsky via 
smartos-discuss wrote:

> SmartOS hang strangely: smartos itself, native VM's and KVM's continued
> responding to ping on their IP's but nothing else worked.
> 
> After hardware restart I cannot login to system: after getting root
> password it waits for something and does not show shell prompt. VM's are
> not running. But network interface comes up, ssh prints banner
> "SSH-2.0-Sun_SSH_1.5" and the same way as on console hangs after getting
> password from user.
> 
> on client ssh -v stops on the following:
> 
> debug1: kex: server->client aes128-ctr hmac-md5 none
> debug1: kex: client->server aes128-ctr hmac-md5 none
> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<3072<8192) sent
> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
> 
> 
> When I boot with noimport=true, I'm able to login with default password and
> able to do zpool import zones. and pool seems to be in normal healthy status

Most, but not all, instances like this where the system seems ok until
you try to actually log in or do something with it are actually caused
by problems in the disk subsystem.  These problems may be transient or
persistent, and they may be caused by software bugs or by hardware or
firmware issues; the latter are more common.  When you boot with
noimport and then import, can you subsequently enable all services and
then ssh in?  What does fmadm faulty show you?  If nothing, are there
errors occurring that are precursors to fault diagnosis?  You can find
that out via fmdump -e.  Anything in the logs (you'll need to import the
pool first to read them, which is also the case with the FMA data).

Failing all of that, I would recommend booting with -m milestone=none.
You should be able to log in using the *platform* default root password
(which is not the same as the one you set at setup time).  From there,
you should be able to set up DTrace probes to monitor the progress of
startup, then do 'svcadm milestone all' to start all the services.  DO
NOT LOG OUT OF THE CONSOLE!  You will need it to monitor and debug the
problem.  If all services (except of course console-login) seem to come
up normally, you can then use your favourite tools -- DTrace, truss,
mdb, etc. -- to debug the sshd server when you try to log in.  You'll
likely need to iterate a few times to narrow your search for the problem
as your understanding improves.

This is a naive brute-force approach to debugging that almost always
yields progress of some kind, even if it's negative progress.  If you
can't learn anything at all this way, a last-ditch option (which likely
won't work if the problem is with the disks or HBA) is to generate an
NMI, which will cause the system to panic and create a crash dump.  If
you then boot and import the pool, you should be able to run savecore to
grab the dump, which can then be analysed to better understand why
things were hanging.  How to generate an NMI is hardware-specific, and
most desktop or consumer-type systems don't support it.  Among those
that do, the most common way is to issue the IPMI 'chassis power diag'
command remotely using ipmitool.  We ship this tool, and it's widely
available on all POSIX-type OSs.  If your system doesn't have a BMC,
or that doesn't work, consult your vendor-supplied docs.


-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to