Hello,

I am experimenting with UML in a HPC cluster. What I do is basically start 
up 60 instances all at once, a bunch of instances on each hardware node, 
using the resource manager TORQUE. Each instance gets a different umid. 
The instances are configured to boot up, execute a job and halt after 
that. Most of the times it works very well. However, every now and then 
some instance of the 60 will get stuck with the infamous "INIT: Id 0 
respawning too fast" message at boot and consequently neither run the job 
nor terminate.

So far I have found mentions of two possible causes for this problem: 1) 
wrong name of the tty device in inittab 2) /lib/tls problem. Neither 
applies in my case (/dev/tty0 is correct, and I have already renamed 
/lib/tls, just in case).

As I can reproduce the problem "statistically" (quite reliably in the 
cluster context) but not at will when running a single instance from the 
command line, my question is: how should I proceed about troubleshooting 
it? Are there any locations in the UML kernel code where I could insert 
some debug statements (or maybe delays? maybe the problem is 
timing-related somehow?) to gather useful diagnostic information?

Best regards -
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
Betriebliches Informationsmanagement
Escherweg 2  - 26121 Oldenburg - Germany
Fon: +49 441 9722 - 184 Fax: +49 441 9722 - 202
E-Mail: [EMAIL PROTECTED] - URL: http://www.offis.de

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
User-mode-linux-user mailing list
User-mode-linux-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-user

Reply via email to