Чт, 22 января 2015, 14:24

Problem with booting SmartOS

Recently we have had a problem with booting SmartOS.


Valentin Zaretsky described the following issue: 

« SmartOS hang strangely: smartos itself, native VM's and KVM's continued responding to ping on their IP's but nothing else worked. 

After hardware restart I cannot login to system: after getting root password it waits for something and does not show shell prompt. VM's are not running. But network interface comes up, ssh prints banner «SSH-2.0-Sun_SSH_1.5» and the same way as on console hangs after getting password from user.  on client ssh -v stops on the following:  debug1: kex: server->client aes128-ctr hmac-md5 none debug1: kex: client->server aes128-ctr hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<3072<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP   

When I boot with noimport=true, I'm able to login with default password and able to do zpool import zones. And pool seems to be in normal healthy status   System is rather old — 20131128T230213Z but had no problems all the time running so I did not upgrade it. »


Keith Wesolowski gave us the following advice:

« Most, but not all, instances like this where the system seems ok until you try to actually log in or do something with it are actually caused by problems in the disk subsystem.  These problems may be transient or persistent, and they may be caused by software bugs or by hardware or firmware issues; the latter are more common.  When you boot with noimport and then import, can you subsequently enable all services and then ssh in?  What does fmadm faulty show you?  If nothing, are there errors occurring that are precursors to fault diagnosis?  You can find that out via fmdump -e.  Anything in the logs (you'll need to import the pool first to read them, which is also the case with the FMA data).  Failing all of that, I would recommend booting with -m milestone=none. You should be able to log in using the *platform* default root password (which is not the same as the one you set at setup time).  From there, you should be able to set up DTrace probes to monitor the progress of startup, then do 'svcadm milestone all' to start all the services.  DO NOT LOG OUT OF THE CONSOLE!  You will need it to monitor and debug the problem.  

If all services (except of course console-login) seem to come up normally, you can then use your favourite tools — DTrace, truss, mdb, etc. — to debug the sshd server when you try to log in.  You'll likely need to iterate a few times to narrow your search for the problem as your understanding improves.  This is a naive brute-force approach to debugging that almost always yields progress of some kind, even if it's negative progress.  If you can't learn anything at all this way, a last-ditch option (which likely won't work if the problem is with the disks or HBA) is to generate an NMI, which will cause the system to panic and create a crash dump.  If you then boot and import the pool, you should be able to run savecore to grab the dump, which can then be analysed to better understand why things were hanging.  How to generate an NMI is hardware-specific, and most desktop or consumer-type systems don't support it.  Among those that do, the most common way is to issue the IPMI 'chassis power diag' command remotely using ipmitool.  We ship this tool, and it's widely available on all POSIX-type OSs.  If your system doesn't have a BMC, or that doesn't work, consult your vendor-supplied docs. »


We have yet to check everything that he advised but anyway now we know much more interesting things about SmartOS booting process than we have ever known.