Re: [Haifux] Login console freezes: Eli's weekly riddle
On 07/11/10 00:44, Eli Billauer wrote: and I can only hope that those who maintain the software controlling my car's airbag are doing so with a deeper understanding of what each software component stands. Can't speak for the airbag industry. I do and did have clients in the aeronautics industry, though, and I can categorically tell you that things there are much much much better. A typical semi-critical[1] software piece designed to run in a civilian aircarft's cockpit gets allocated a minimum of 9 hours of QA for each hour of development. There are whole companies dedicated to code reviewing and certifying, and there are strict, often draconian, software design and style restrictions. I'm not sure what I feel about such bugs in the Linux kernel, but it[2] does not diminish my safe feeling while flying. Shachar 1 - Semi-critical - the pilot might, indirectly, rely on the information received, but it does not, itself control the aircraft, nor does it even provide flight-critical data. Maps are non-critical (uncertified). Maps showing the aircraft's position on them are semi-critical, and must be certified. 2 - Which is not to say that other things might. -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
On 29/10/10 17:04, Eli Billauer wrote: /* find a device that is not in use. */ printk(KERN_ALERT 34: pty_open to lock\n); tty_lock(); printk(KERN_ALERT 35: pty_open locked\n); snip Set a global variable right before the tty_lock call, and clear it immediately after. Inside tty_lock (and probably tty_unlock too), set up many printks conditional on this global variable being set. Print any relevant identifier you can find (such as the device ID). This should help you find out WHY the device takes so long to lock. and hopefully, who the contention is with. Also, in tty_lock, save to a global variable who is holding the lock, and print that variable from the code above. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
Hello Shachar. I suppose great minds think alike. ;) This is more or less what I've already done, and with some correspondence on LKML, I've managed to nail the exact place where the kernel goes to sleep for 30 seconds taking the lock with it. Sorry for not updating you, but my thought was to summarize the issue when the patch is out. So a kernel bug it is indeed. It wasn't the simple thing I thought, though. It turns out, that under some conditions, a close() operation on a serial port may require waiting until all written data is flushed. In these cases, it's normal behavior that returning from the close() call takes up to 30 seconds (the timeout). In particular, when it's modem-manager trying to probe serial ports that don't really exist. Now, work was done on the TTYs' locking schema between kernels 2.6.35 and 2.6.36, and the maintainer overlooked this possibility, and hence had no problem with the lock being held at that crucial point when the data is being drained. As a result, all system calls related to TTYs and PTYs (that is, serial ports and *cough* virtual terminals, and I suppose other keyboard input) were frozen waiting for the big TTY mutex every time modem-manager chose to close a port. Which it does a few times after a boot. The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some reason, my postings break the threading every time, even though I reply-to-all). This is not a trivial thing to solve, since it seems like nobody really knows what assumptions have been made on the two muteces involved, so it's not so clear if one can release them just before the possible long sleep. I suppose the guy who manipulated the locks will come up with a fix sooner or later. I've reverted to 2.6.35 anyhow. Ah, and by the way, what started this thing was an effort to stop using the big kernel lock. As a matter of fact, from 2.6.36, the big kernel lock is no longer used in core kernel code. (Hurray...?) So that's the way things stand. I can't say I was very encouraged by this little trip to kernel-land, and I can only hope that those who maintain the software controlling my car's airbag are doing so with a deeper understanding of what each software component stands. Don't tell me. They probably don't. Only they don't discuss their issues over a public mailing list. Eli Shachar Shemesh wrote: On 29/10/10 17:04, Eli Billauer wrote: /* find a device that is not in use. */ printk(KERN_ALERT 34: pty_open to lock\n); tty_lock(); printk(KERN_ALERT 35: pty_open locked\n); snip Set a global variable right before the tty_lock call, and clear it immediately after. Inside tty_lock (and probably tty_unlock too), set up many printks conditional on this global variable being set. Print any relevant identifier you can find (such as the device ID). This should help you find out WHY the device takes so long to lock. and hopefully, who the contention is with. Also, in tty_lock, save to a global variable who is holding the lock, and print that variable from the code above. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com -- Web: http://www.billauer.co.il ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
On Sunday, 7 בNovember 2010 00:44:12 Eli Billauer wrote: The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some Eli, I always suspected you are a programmer disguised as a EE, now you proved me right ;-) That was a very interesting read. -- Oron Peled Voice: +972-4-8228492 o...@actcom.co.il http://users.actcom.co.il/~oron Ignore Your Rights And They'll Go Away ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
On 28/10/10 17:59, Eli Billauer wrote: Hi, PAM did indeed cross my mind as a suspect. Unfortunately, I have no /var/log/auth* logfiles. As a matter of fact, I don't have any log messages related to authentication. In my very old Redhat 7.3, I had pam_unix messages in /var/log/messages telling me who has changed user ID. On Fedora 12, nothing is mentioned about becoming root in /var/log/messages, and I can't see any directory or file in /var/log that seems to contain that information. Maybe under the audit directory, but the files there are not really human readable. Just some random thought - perchance the two are not independent? Maybe you have a PAM configuration problems that causes both the lack of logs AND the slow login? Also, did you try public key authentication via ssh to see whether that one is faster? Any idea of where my authenticate mechanism keeps its logs? Should probably have been where your authentication mechanism keeps the config files that tell it where to generate the logs. Give a man a fish, and he'll eat for a day. Teach him how to cook, and he'll die starving, dreaming of all the wonderful things he could have cooked, had only he had a fish And why a kernel upgrade would have any effect? Like I said. I suspect the configuration was broken even before the upgrade. rantIn the past, I clearly recall just upgrading my kernel and all was fine. Where are those days?/rant I'm sorry, you mistagged the above. We employ a strict four pages minimum on rants. Soap box addresses between one and four pages should carry the bicker tag. Complaints over two lines long should carry the hidepark tag. This two sentences should have carried the whine tag. whineThere used to be a time when you could expect old timers to read the list's FAQ/whine Shachar P.s. You did say Fedora 12, as in RedHat based distribution, as in tons and tons of patches over the vanilla kernel, right? What kernel did you upgrade to? Was it vanilla? Sh. -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
Hello, Even though it would be fun (??!!) to dig into the authentication mechanism, isn't the fact that failing to open /dev/tty takes 30 seconds enough to blame the kernel alone? Or the same time to open /dev/pts/0? Could there be any explanation for this except for a kernel bug? And yes, I'm using Fedora 12 with a 2.6.36 vanilla kernel. I'm a wild guy. But so was my 2.6.35, which has no particular problems, except for what happens when the disk is accessed heavily. So the freeze thing lies in what happened in the gap between two vanillas. And as a side remark, I've had two null pointer oopses when logging out the GUI console. The caller seems to be selinux_socket_unix_stream_connect(). Oopses. I think I'm going to post a kernel patch which paints the screen background blue. My current standpoint is that I've seen enough to know who to blame, and that leaves me with the possibility to post a kernel bug report (unless someone tells me I'll make a fool of myself doing so) or just wait until a kernel patch emerges. As I logout or boot very rarely, I can live with this issue for a month or two. Please correct me if I'm wrong. Thanks, Eli Shachar Shemesh wrote: On 28/10/10 17:59, Eli Billauer wrote: Hi, PAM did indeed cross my mind as a suspect. Unfortunately, I have no /var/log/auth* logfiles. As a matter of fact, I don't have any log messages related to authentication. In my very old Redhat 7.3, I had pam_unix messages in /var/log/messages telling me who has changed user ID. On Fedora 12, nothing is mentioned about becoming root in /var/log/messages, and I can't see any directory or file in /var/log that seems to contain that information. Maybe under the audit directory, but the files there are not really human readable. Just some random thought - perchance the two are not independent? Maybe you have a PAM configuration problems that causes both the lack of logs AND the slow login? Also, did you try public key authentication via ssh to see whether that one is faster? Any idea of where my authenticate mechanism keeps its logs? Should probably have been where your authentication mechanism keeps the config files that tell it where to generate the logs. Give a man a fish, and he'll eat for a day. Teach him how to cook, and he'll die starving, dreaming of all the wonderful things he could have cooked, had only he had a fish And why a kernel upgrade would have any effect? Like I said. I suspect the configuration was broken even before the upgrade. rantIn the past, I clearly recall just upgrading my kernel and all was fine. Where are those days?/rant I'm sorry, you mistagged the above. We employ a strict four pages minimum on rants. Soap box addresses between one and four pages should carry the bicker tag. Complaints over two lines long should carry the hidepark tag. This two sentences should have carried the whine tag. whineThere used to be a time when you could expect old timers to read the list's FAQ/whine Shachar P.s. You did say Fedora 12, as in RedHat based distribution, as in tons and tons of patches over the vanilla kernel, right? What kernel did you upgrade to? Was it vanilla? Sh. -- Web: http://www.billauer.co.il ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
Maxim Kovgan wrote: 1) I hope I didn't miss a sentence in one of the emails I've searched the google for 48 hours for similar problem... I always check Google first. Just to find out that it's adapting itself to common people. For example, looking for pts, I get hits on the word points. Looks like we're loosing Google too. As for the arguments about TTYs blocking, I don't think that is relevant during the open() call. Opening a TTY/PTY is doing nothing except allocating memory and registering stuff, as far as I know. No reason it should take any time. Unless... Since there's a little kernel hacker hiding in me, I decided to put printk's all over the pty.c file. I ended up with 37 of them, and here's the interesting part (in the modified code): snip static int ptmx_open(struct inode *inode, struct file *filp) { struct tty_struct *tty; int retval; int index; nonseekable_open(inode, filp); /* find a device that is not in use. */ printk(KERN_ALERT 34: pty_open to lock\n); tty_lock(); printk(KERN_ALERT 35: pty_open locked\n); snip OK, so I wrote pty_code where it should be ptmx_open. But the numbers are the real identifiers. And here's a small snippet from my /var/log/messages: Oct 29 16:14:34 ocho kernel: 02: pty_close to lock 880200f48800 Oct 29 16:14:34 ocho kernel: 03: pty_close did lock 880200f48800 Oct 29 16:14:58 ocho kernel: 34: pty_open to lock Oct 29 16:15:13 ocho kernel: 35: pty_open locked Hmmm... 15 seconds to acquire a lock. On other 34-35 pairs this took no-time, of course. In case you wonder, tty_lock() and tty_unlock() gets and releases the global mutex for all TTY related stuff. Introduced somewhere between my old 2.6.35 and 2.6.36. Could it have something to do with my problem? Or maybe, it has something to do with pty_close() looking like this (the printks above omitted)? static void (struct tty_struct *tty, struct file *filp) { [ ... do the stuff, but no hassle with locks... ] tty_*un*lock(); tty_vhangup(tty-link); tty_lock(); } } Hmmm... Calling tty_unlock() first and then tty_lock(). And then return. Maybe I missed something here, but this doesn't look good to me. I didn't get it wrong. All other calls are lock(), then unlock(). Of course. I've sent a little note to the relevant kernel maintainer about this. In the meanwhile, I'll stick to my older kernel. Had enough with those oopses. Thanks to Guy again for that little tip. Eli -- Web: http://www.billauer.co.il ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
On 28/10/10 13:45, Eli Billauer wrote: Hello, The alert reader may have noticed that I've upgraded to 2.6.36. Which has caused a weird behavior with Fedora 12. I should start with sating, that after a couple of minutes, all weird things described below are over and forgotten, and everything runs as a clockwork. So let's see what happens immediately after bootup. The login screen appears and is responsive. I click my own user, and it's fine, but when I type the first character of my password, the GUI freezes for several seconds, and then becomes normal until I press another key on the keyboard. And so forth. While this is going on, logging in from ssh or a text console is extremely slow. In particular, when trying to ssh to the host, the prompt for password appears immediately (server is alive and kicking), but after entering the password it can take a minute or so until the login actually occurs (looks like someone holding some kind of lock?). Another interesting thing is, that if I log in first, and then get the GUI login stuck, I see the following while the GUI is frozen: (1) CPU usage remains idle. (2) A stupid echo loop in bash runs as fast as it normally does (no I/O issues, and CPU is free to run) (3) No special process jumps high on the top application. Any ideas? Any cool logfiles to look at, except for /var/log/{messages, gdm/*}? If both ssh and the graphical logins are slow, I would suspect PAM. Try ssh public key authentication and see what happens (I think that one bypasses pam). Also, see what /var/log/auth* have to say. Shachar -- Shachar Shemesh Lingnu Open Source Consulting Ltd. http://www.lingnu.com ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
Hi, PAM did indeed cross my mind as a suspect. Unfortunately, I have no /var/log/auth* logfiles. As a matter of fact, I don't have any log messages related to authentication. In my very old Redhat 7.3, I had pam_unix messages in /var/log/messages telling me who has changed user ID. On Fedora 12, nothing is mentioned about becoming root in /var/log/messages, and I can't see any directory or file in /var/log that seems to contain that information. Maybe under the audit directory, but the files there are not really human readable. Any idea of where my authenticate mechanism keeps its logs? And why a kernel upgrade would have any effect? rantIn the past, I clearly recall just upgrading my kernel and all was fine. Where are those days?/rant Thanks, Eli Shachar Shemesh wrote: If both ssh and the graphical logins are slow, I would suspect PAM. Try ssh public key authentication and see what happens (I think that one bypasses pam). Also, see what /var/log/auth* have to say. Shachar -- Web: http://www.billauer.co.il ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
why not strace sshd and find what it's doing? log into the machine, then run a separate sshd under an alternate port , and under 'strace -f -o file -tt -T', then see what strace shows you it is doing, and which operations are taking a long time (-tt gives a timestamp for each syscall invocation, -T gives the time a syscall invocation took to complete). --guy Eli Billauer wrote: Hi, PAM did indeed cross my mind as a suspect. Unfortunately, I have no /var/log/auth* logfiles. As a matter of fact, I don't have any log messages related to authentication. In my very old Redhat 7.3, I had pam_unix messages in /var/log/messages telling me who has changed user ID. On Fedora 12, nothing is mentioned about becoming root in /var/log/messages, and I can't see any directory or file in /var/log that seems to contain that information. Maybe under the audit directory, but the files there are not really human readable. Any idea of where my authenticate mechanism keeps its logs? And why a kernel upgrade would have any effect? rantIn the past, I clearly recall just upgrading my kernel and all was fine. Where are those days?/rant Thanks, Eli Shachar Shemesh wrote: If both ssh and the graphical logins are slow, I would suspect PAM. Try ssh public key authentication and see what happens (I think that one bypasses pam). Also, see what /var/log/auth* have to say. Shachar ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
Re: [Haifux] Login console freezes: Eli's weekly riddle
guy keren wrote: why not strace sshd and find what it's doing? That turned out to be a brilliant idea. I was more aggressive in my approach than you suggested, since I need to know what happens before I can log in, so I injected an strace into the service startup script (with a nomination to the world's ugliest hack). Then I started the service and opened a session while the system behaved normally, and rebooted to try it out during those few minutes of unresponsiveness. I went for the -ff option, which means that I got a file for each fork (and there are oh so many). With a little Perl script to pick up the most time consuming commands, I reached the critical snippet, which is actually the beginning of one of the processes. What we can see, is that trying to open /dev/tty (and fail) took 30 + epsilon seconds, and then opening /dev/pts/0 (successfully) took another 30 + epsilon seconds. Makes me wonder what timeout is set to 30 seconds in the kernel. Needless to say, in the proper strace, these operations took a fraction of a second, like one would expect. Interestingly, it's exactly the same chain of events, only different timing (give or take which pts was opened). Looks like I exchanged one kernel bug for another. Maybe I should report this to LKML...? And still, if this really is a kernel bug, how could it escape the simplest test before the kernel was released? Eli - Trace for freshly booted system follows: 21:35:21.131094 close(12) = 0 0.46 21:35:21.131177 close(13) = 0 0.52 21:35:21.131259 close(4)= 0 0.59 21:35:21.131382 close(8)= 0 0.27 21:35:21.131436 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) *30.022532* 21:35:51.154269 setsid()= 2333 0.000155 21:35:51.154639 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) 0.020142 21:35:51.174931 ioctl(9, TIOCSCTTY) = 0 0.30 21:35:51.175049 rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 0.28 21:35:51.175176 rt_sigaction(SIGHUP, {SIG_IGN, [], SA_RESTORER, 0x7fad91c31740}, NULL, 8) = 0 0.28 21:35:51.175309 vhangup() = -1 EPERM (Operation not permitted) 0.26 21:35:51.175407 rt_sigaction(SIGHUP, NULL, {SIG_IGN, [], SA_RESTORER, 0x7fad91c31740}, 8) = 0 0.27 21:35:51.175524 rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 0x7fad91c31740}, NULL, 8) = 0 0.28 21:35:51.175642 open(/dev/pts/0, O_RDWR) = 4 *30.063213* 21:36:21.238952 close(9)= 0 0.33 21:36:21.239057 open(/dev/tty, O_WRONLY) = 8 0.021415 21:36:21.260555 close(8)= 0 0.34 21:36:21.260660 dup2(4, 0) = 0 0.21 21:36:21.260786 dup2(4, 1) = 1 0.27 21:36:21.260882 dup2(4, 2) = 2 0.26 21:36:21.260973 close(4)= 0 0.23 (and so on...) And here's the parallel trace for a warm and cosy system: 21:28:19.163952 close(12) = 0 0.52 21:28:19.164125 close(13) = 0 0.51 21:28:19.164295 close(4)= 0 0.43 21:28:19.164435 close(8)= 0 0.55 21:28:19.164577 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) *0.68* 21:28:19.164891 setsid()= 19458 0.22 21:28:19.164990 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such device or address) 0.66 21:28:19.165132 ioctl(9, TIOCSCTTY) = 0 0.24 21:28:19.165230 rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 0.43 21:28:19.165379 rt_sigaction(SIGHUP, {SIG_IGN, [], SA_RESTORER, 0x7f78bf8fe740}, NULL, 8) = 0 0.34 21:28:19.165502 vhangup() = -1 EPERM (Operation not permitted) 0.33 21:28:19.165604 rt_sigaction(SIGHUP, NULL, {SIG_IGN, [], SA_RESTORER, 0x7f78bf8fe740}, 8) = 0 0.48 21:28:19.165828 rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 0x7f78bf8fe740}, NULL, 8) = 0 0.38 21:28:19.165937 open(/dev/pts/2, O_RDWR) = 4 *0.27* 21:28:19.166011 close(9)= 0 0.17 21:28:19.166065 open(/dev/tty, O_WRONLY) = 8 0.21 21:28:19.166134 close(8)= 0 0.14 21:28:19.166191 dup2(4, 0) = 0 0.14 21:28:19.166246 dup2(4, 1) = 1 0.14 21:28:19.166301 dup2(4, 2) = 2 0.14 21:28:19.166355 close(4)= 0 0.14 -- Web: http://www.billauer.co.il ___ Haifux mailing list Haifux@haifux.org http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux