Thank you so much, Keith! I have yet to check everything you advised but anyway now I know much more interesting things about smartos booting process than I ever knew
Best regards, Valentin Zaretsky On Sun, Jul 20, 2014 at 2:53 AM, Keith Wesolowski < [email protected]> wrote: > On Sat, Jul 19, 2014 at 11:19:03PM +0300, Valentine Zaretsky via > smartos-discuss wrote: > > > SmartOS hang strangely: smartos itself, native VM's and KVM's continued > > responding to ping on their IP's but nothing else worked. > > > > After hardware restart I cannot login to system: after getting root > > password it waits for something and does not show shell prompt. VM's are > > not running. But network interface comes up, ssh prints banner > > "SSH-2.0-Sun_SSH_1.5" and the same way as on console hangs after getting > > password from user. > > > > on client ssh -v stops on the following: > > > > debug1: kex: server->client aes128-ctr hmac-md5 none > > debug1: kex: client->server aes128-ctr hmac-md5 none > > debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<3072<8192) sent > > debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP > > > > > > When I boot with noimport=true, I'm able to login with default password > and > > able to do zpool import zones. and pool seems to be in normal healthy > status > > Most, but not all, instances like this where the system seems ok until > you try to actually log in or do something with it are actually caused > by problems in the disk subsystem. These problems may be transient or > persistent, and they may be caused by software bugs or by hardware or > firmware issues; the latter are more common. When you boot with > noimport and then import, can you subsequently enable all services and > then ssh in? What does fmadm faulty show you? If nothing, are there > errors occurring that are precursors to fault diagnosis? You can find > that out via fmdump -e. Anything in the logs (you'll need to import the > pool first to read them, which is also the case with the FMA data). > > Failing all of that, I would recommend booting with -m milestone=none. > You should be able to log in using the *platform* default root password > (which is not the same as the one you set at setup time). From there, > you should be able to set up DTrace probes to monitor the progress of > startup, then do 'svcadm milestone all' to start all the services. DO > NOT LOG OUT OF THE CONSOLE! You will need it to monitor and debug the > problem. If all services (except of course console-login) seem to come > up normally, you can then use your favourite tools -- DTrace, truss, > mdb, etc. -- to debug the sshd server when you try to log in. You'll > likely need to iterate a few times to narrow your search for the problem > as your understanding improves. > > This is a naive brute-force approach to debugging that almost always > yields progress of some kind, even if it's negative progress. If you > can't learn anything at all this way, a last-ditch option (which likely > won't work if the problem is with the disks or HBA) is to generate an > NMI, which will cause the system to panic and create a crash dump. If > you then boot and import the pool, you should be able to run savecore to > grab the dump, which can then be analysed to better understand why > things were hanging. How to generate an NMI is hardware-specific, and > most desktop or consumer-type systems don't support it. Among those > that do, the most common way is to issue the IPMI 'chassis power diag' > command remotely using ipmitool. We ship this tool, and it's widely > available on all POSIX-type OSs. If your system doesn't have a BMC, > or that doesn't work, consult your vendor-supplied docs. > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
