Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-11-07 Thread Shachar Shemesh
On 07/11/10 00:44, Eli Billauer wrote:
 and I can only hope that those who maintain the software controlling 
 my car's airbag are doing so with a deeper understanding of what each 
 software component stands.

Can't speak for the airbag industry. I do and did have clients in the 
aeronautics industry, though, and I can categorically tell you that 
things there are much much much better. A typical semi-critical[1] 
software piece designed to run in a civilian aircarft's cockpit gets 
allocated a minimum of 9 hours of QA for each hour of development. There 
are whole companies dedicated to code reviewing and certifying, and 
there are strict, often draconian, software design and style restrictions.

I'm not sure what I feel about such bugs in the Linux kernel, but it[2] 
does not diminish my safe feeling while flying.

Shachar

1 - Semi-critical - the pilot might, indirectly, rely on the information 
received, but it does not, itself control the aircraft, nor does it even 
provide flight-critical data. Maps are non-critical (uncertified). Maps 
showing the aircraft's position on them are semi-critical, and must be 
certified.
2 - Which is not to say that other things might.

-- 
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-11-06 Thread Shachar Shemesh

On 29/10/10 17:04, Eli Billauer wrote:


/* find a device that is not in use. */
printk(KERN_ALERT  34: pty_open to lock\n);
tty_lock();
printk(KERN_ALERT  35: pty_open locked\n);
snip

Set a global variable right before the tty_lock call, and clear it 
immediately after. Inside tty_lock (and probably tty_unlock too), set up 
many printks conditional on this global variable being set. Print any 
relevant identifier you can find (such as the device ID). This should 
help you find out WHY the device takes so long to lock. and hopefully, 
who the contention is with.


Also, in tty_lock, save to a global variable who is holding the lock, 
and print that variable from the code above.


Shachar

--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-11-06 Thread Eli Billauer

Hello Shachar.


I suppose great minds think alike. ;) This is more or less what I've 
already done, and with some correspondence on LKML, I've managed to nail 
the exact place where the kernel goes to sleep for 30 seconds taking the 
lock with it. Sorry for not updating you, but my thought was to 
summarize the issue when the patch is out.



So a kernel bug it is indeed. It wasn't the simple thing I thought, 
though. It turns out, that under some conditions, a close() operation on 
a serial port may require waiting until all written data is flushed. In 
these cases, it's normal behavior that returning from the close() call 
takes up to 30 seconds (the timeout). In particular, when it's 
modem-manager trying to probe serial ports that don't really exist.



Now, work was done on the TTYs' locking schema between kernels 2.6.35 
and 2.6.36, and the maintainer overlooked this possibility, and hence 
had no problem with the lock being held at that crucial point when the 
data is being drained. As a result, all system calls related to TTYs and 
PTYs (that is, serial ports and *cough* virtual terminals, and I suppose 
other keyboard input) were frozen waiting for the big TTY mutex every 
time modem-manager chose to close a port. Which it does a few times 
after a boot.



The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some 
reason, my postings break the threading every time, even though I 
reply-to-all).



This is not a trivial thing to solve, since it seems like nobody really 
knows what assumptions have been made on the two muteces involved, so 
it's not so clear if one can release them just before the possible long 
sleep. I suppose the guy who manipulated the locks will come up with a 
fix sooner or later. I've reverted to 2.6.35 anyhow.



Ah, and by the way, what started this thing was an effort to stop using 
the big kernel lock. As a matter of fact, from 2.6.36, the big kernel 
lock is no longer used in core kernel code. (Hurray...?)



So that's the way things stand. I can't say I was very encouraged by 
this little trip to kernel-land, and I can only hope that those who 
maintain the software controlling my car's airbag are doing so with a 
deeper understanding of what each software component stands. Don't tell 
me. They probably don't. Only they don't discuss their issues over a 
public mailing list.


  Eli



Shachar Shemesh wrote:


On 29/10/10 17:04, Eli Billauer wrote:


/* find a device that is not in use. */
printk(KERN_ALERT  34: pty_open to lock\n);
tty_lock();
printk(KERN_ALERT  35: pty_open locked\n);
snip

Set a global variable right before the tty_lock call, and clear it 
immediately after. Inside tty_lock (and probably tty_unlock too), set 
up many printks conditional on this global variable being set. Print 
any relevant identifier you can find (such as the device ID). This 
should help you find out WHY the device takes so long to lock. and 
hopefully, who the contention is with.


Also, in tty_lock, save to a global variable who is holding the lock, 
and print that variable from the code above.


Shachar
--
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com
  



--
Web: http://www.billauer.co.il

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-11-06 Thread Oron Peled
On Sunday, 7 בNovember 2010 00:44:12 Eli Billauer wrote:
 The LKML thread is at http://lkml.org/lkml/2010/11/2/314 (for some 

Eli, I always suspected you are a programmer disguised as a EE, now
you proved me right ;-)

That was a very interesting read.

-- 
Oron Peled Voice: +972-4-8228492
o...@actcom.co.il  http://users.actcom.co.il/~oron
Ignore Your Rights And They'll Go Away
___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-29 Thread Shachar Shemesh
On 28/10/10 17:59, Eli Billauer wrote:
 Hi,


 PAM did indeed cross my mind as a suspect. Unfortunately, I have no 
 /var/log/auth* logfiles. As a matter of fact, I don't have any log 
 messages related to authentication. In my very old Redhat 7.3, I had 
 pam_unix messages in /var/log/messages telling me who has changed user 
 ID. On Fedora 12, nothing is mentioned about becoming root in 
 /var/log/messages, and I can't see any directory or file in /var/log 
 that seems to contain that information. Maybe under the audit 
 directory, but the files there are not really human readable.

Just some random thought - perchance the two are not independent? Maybe 
you have a PAM configuration problems that causes both the lack of logs 
AND the slow login?

Also, did you try public key authentication via ssh to see whether that 
one is faster?



 Any idea of where my authenticate mechanism keeps its logs?

Should probably have been where your authentication mechanism keeps the 
config files that tell it where to generate the logs. Give a man a 
fish, and he'll eat for a day. Teach him how to cook, and he'll die 
starving, dreaming of all the wonderful things he could have cooked, had 
only he had a fish

 And why a kernel upgrade would have any effect?
Like I said. I suspect the configuration was broken even before the upgrade.


 rantIn the past, I clearly recall just upgrading my kernel and all 
 was fine. Where are those days?/rant
I'm sorry, you mistagged the above. We employ a strict four pages 
minimum on rants. Soap box addresses between one and four pages should 
carry the bicker tag. Complaints over two lines long should carry the 
hidepark tag. This two sentences should have carried the whine tag.

whineThere used to be a time when you could expect old timers to read 
the list's FAQ/whine

Shachar

P.s.

You did say Fedora 12, as in RedHat based distribution, as in tons 
and tons of patches over the vanilla kernel, right? What kernel did you 
upgrade to? Was it vanilla?
Sh.

-- 
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-29 Thread Eli Billauer
Hello,


Even though it would be fun (??!!) to dig into the authentication 
mechanism, isn't the fact that failing to open /dev/tty takes 30 seconds 
enough to blame the kernel alone? Or the same time to open /dev/pts/0? 
Could there be any explanation for this except for a kernel bug?


And yes, I'm using Fedora 12 with a 2.6.36 vanilla kernel. I'm a wild 
guy. But so was my 2.6.35, which has no particular problems, except for 
what happens when the disk is accessed heavily. So the freeze thing lies 
in what happened in the gap between two vanillas.


And as a side remark, I've had two null pointer oopses when logging out 
the GUI console. The caller seems to be 
selinux_socket_unix_stream_connect(). Oopses. I think I'm going to post 
a kernel patch which paints the screen background blue.


My current standpoint is that I've seen enough to know who to blame, and 
that leaves me with the possibility to post a kernel bug report (unless 
someone tells me I'll make a fool of myself doing so) or just wait until 
a kernel patch emerges. As I logout or boot very rarely, I can live with 
this issue for a month or two. Please correct me if I'm wrong.


Thanks,

   Eli


Shachar Shemesh wrote:

 On 28/10/10 17:59, Eli Billauer wrote:
 Hi,


 PAM did indeed cross my mind as a suspect. Unfortunately, I have no 
 /var/log/auth* logfiles. As a matter of fact, I don't have any log 
 messages related to authentication. In my very old Redhat 7.3, I had 
 pam_unix messages in /var/log/messages telling me who has changed 
 user ID. On Fedora 12, nothing is mentioned about becoming root in 
 /var/log/messages, and I can't see any directory or file in /var/log 
 that seems to contain that information. Maybe under the audit 
 directory, but the files there are not really human readable.

 Just some random thought - perchance the two are not independent? 
 Maybe you have a PAM configuration problems that causes both the lack 
 of logs AND the slow login?

 Also, did you try public key authentication via ssh to see whether 
 that one is faster?



 Any idea of where my authenticate mechanism keeps its logs?

 Should probably have been where your authentication mechanism keeps 
 the config files that tell it where to generate the logs. Give a man 
 a fish, and he'll eat for a day. Teach him how to cook, and he'll die 
 starving, dreaming of all the wonderful things he could have cooked, 
 had only he had a fish

 And why a kernel upgrade would have any effect?
 Like I said. I suspect the configuration was broken even before the 
 upgrade.


 rantIn the past, I clearly recall just upgrading my kernel and all 
 was fine. Where are those days?/rant
 I'm sorry, you mistagged the above. We employ a strict four pages 
 minimum on rants. Soap box addresses between one and four pages should 
 carry the bicker tag. Complaints over two lines long should carry 
 the hidepark tag. This two sentences should have carried the whine 
 tag.

 whineThere used to be a time when you could expect old timers to 
 read the list's FAQ/whine

 Shachar

 P.s.

 You did say Fedora 12, as in RedHat based distribution, as in 
 tons and tons of patches over the vanilla kernel, right? What kernel 
 did you upgrade to? Was it vanilla?
 Sh.



-- 
Web: http://www.billauer.co.il

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-29 Thread Eli Billauer

Maxim Kovgan wrote:


1) I hope I didn't miss a sentence in one of the emails I've searched the google 
for 48 hours for similar problem...
  
I always check Google first. Just to find out that it's adapting itself 
to common people. For example, looking for pts, I get hits on the 
word points. Looks like we're loosing Google too.


As for the arguments about TTYs blocking, I don't think that is relevant 
during the open() call. Opening a TTY/PTY is doing nothing except 
allocating memory and registering stuff, as far as I know. No reason it 
should take any time. Unless...


Since there's a little kernel hacker hiding in me, I decided to put 
printk's all over the pty.c file. I ended up with 37 of them, and here's 
the interesting part (in the modified code):


snip
static int ptmx_open(struct inode *inode, struct file *filp)
{
   struct tty_struct *tty;
   int retval;
   int index;

   nonseekable_open(inode, filp);

   /* find a device that is not in use. */
   printk(KERN_ALERT  34: pty_open to lock\n);
   tty_lock();
   printk(KERN_ALERT  35: pty_open locked\n);
snip

OK, so I wrote pty_code where it should be ptmx_open. But the 
numbers are the real identifiers.


And here's a small snippet from my /var/log/messages:

Oct 29 16:14:34 ocho kernel: 02: pty_close to lock 880200f48800
Oct 29 16:14:34 ocho kernel: 03: pty_close did lock 880200f48800
Oct 29 16:14:58 ocho kernel: 34: pty_open to lock
Oct 29 16:15:13 ocho kernel: 35: pty_open locked

Hmmm... 15 seconds to acquire a lock. On other 34-35 pairs this took 
no-time, of course. In case you wonder, tty_lock() and tty_unlock() gets 
and releases the global mutex for all TTY related stuff. Introduced 
somewhere between my old 2.6.35 and 2.6.36. Could it have something to 
do with my problem?


Or maybe, it has something to do with pty_close() looking like this (the 
printks above omitted)?


static void (struct tty_struct *tty, struct file *filp)
{
   [ ... do the stuff, but no hassle with locks... ]
   tty_*un*lock();
   tty_vhangup(tty-link);
   tty_lock();
   }
}

Hmmm... Calling tty_unlock() first and then tty_lock(). And then return. 
Maybe I missed something here, but this doesn't look good to me. I 
didn't get it wrong. All other calls are lock(), then unlock(). Of course.


I've sent a little note to the relevant kernel maintainer about this. In 
the meanwhile, I'll stick to my older kernel. Had enough with those oopses.


Thanks to Guy again for that little tip.

  Eli

--
Web: http://www.billauer.co.il

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-28 Thread Shachar Shemesh
On 28/10/10 13:45, Eli Billauer wrote:
 Hello,


 The alert reader may have noticed that I've upgraded to 2.6.36. Which
 has caused a weird behavior with Fedora 12.


 I should start with sating, that after a couple of minutes, all weird
 things described below are over and forgotten, and everything runs as a
 clockwork. So let's see what happens immediately after bootup.


 The login screen appears and is responsive. I click my own user, and
 it's fine, but when I type the first character of my password, the GUI
 freezes for several seconds, and then becomes normal until I press
 another key on the keyboard. And so forth.


 While this is going on, logging in from ssh or a text console is
 extremely slow. In particular, when trying to ssh to the host, the
 prompt for password appears immediately (server is alive and kicking),
 but after entering the password it can take a minute or so until the
 login actually occurs (looks like someone holding some kind of lock?).


 Another interesting thing is, that if I log in first, and then get the
 GUI login stuck, I see the following while the GUI is frozen:

 (1) CPU usage remains idle.

 (2) A stupid echo loop in bash runs as fast as it normally does (no I/O
 issues, and CPU is free to run)

 (3) No special process jumps high on the top application.


 Any ideas? Any cool logfiles to look at, except for /var/log/{messages,
 gdm/*}?


If both ssh and the graphical logins are slow, I would suspect PAM. Try 
ssh public key authentication and see what happens (I think that one 
bypasses pam). Also, see what /var/log/auth* have to say.

Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-28 Thread Eli Billauer
Hi,


PAM did indeed cross my mind as a suspect. Unfortunately, I have no 
/var/log/auth* logfiles. As a matter of fact, I don't have any log 
messages related to authentication. In my very old Redhat 7.3, I had 
pam_unix messages in /var/log/messages telling me who has changed user 
ID. On Fedora 12, nothing is mentioned about becoming root in 
/var/log/messages, and I can't see any directory or file in /var/log 
that seems to contain that information. Maybe under the audit directory, 
but the files there are not really human readable.


Any idea of where my authenticate mechanism keeps its logs?


And why a kernel upgrade would have any effect?


rantIn the past, I clearly recall just upgrading my kernel and all was 
fine. Where are those days?/rant


Thanks,

   Eli


Shachar Shemesh wrote:

  
 If both ssh and the graphical logins are slow, I would suspect PAM. 
 Try ssh public key authentication and see what happens (I think that 
 one bypasses pam). Also, see what /var/log/auth* have to say.

 Shachar



-- 
Web: http://www.billauer.co.il

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-28 Thread guy keren

why not strace sshd and find what it's doing?

log into the machine, then run a separate sshd under an alternate port , 
and under 'strace -f -o file -tt -T', then see what strace shows 
you it is doing, and which operations are taking a long time (-tt gives 
a timestamp for each syscall invocation, -T gives the time a syscall 
invocation took to complete).

--guy


Eli Billauer wrote:
 Hi,
 
 
 PAM did indeed cross my mind as a suspect. Unfortunately, I have no 
 /var/log/auth* logfiles. As a matter of fact, I don't have any log 
 messages related to authentication. In my very old Redhat 7.3, I had 
 pam_unix messages in /var/log/messages telling me who has changed user 
 ID. On Fedora 12, nothing is mentioned about becoming root in 
 /var/log/messages, and I can't see any directory or file in /var/log 
 that seems to contain that information. Maybe under the audit directory, 
 but the files there are not really human readable.
 
 
 Any idea of where my authenticate mechanism keeps its logs?
 
 
 And why a kernel upgrade would have any effect?
 
 
 rantIn the past, I clearly recall just upgrading my kernel and all was 
 fine. Where are those days?/rant
 
 
 Thanks,
 
Eli
 
 
 Shachar Shemesh wrote:
 
  
 If both ssh and the graphical logins are slow, I would suspect PAM. 
 Try ssh public key authentication and see what happens (I think that 
 one bypasses pam). Also, see what /var/log/auth* have to say.

 Shachar

 
 

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux


Re: [Haifux] Login console freezes: Eli's weekly riddle

2010-10-28 Thread Eli Billauer

guy keren wrote:


why not strace sshd and find what it's doing?
That turned out to be a brilliant idea. I was more aggressive in my 
approach than you suggested, since I need to know what happens before I 
can log in, so I injected an strace into the service startup script 
(with a nomination to the world's ugliest hack). Then I started the 
service and opened a session while the system behaved normally, and 
rebooted to try it out during those few minutes of unresponsiveness.


I went for the -ff option, which means that I got a file for each fork 
(and there are oh so many). With a little Perl script to pick up the 
most time consuming commands, I reached the critical snippet, which is 
actually the beginning of one of the processes.


What we can see, is that trying to open /dev/tty (and fail)  took 30 + 
epsilon seconds, and then opening /dev/pts/0 (successfully) took another 
30 + epsilon seconds. Makes me wonder what timeout is set to 30 seconds 
in the kernel. Needless to say, in the proper strace, these operations 
took a fraction of a second, like one would expect. Interestingly, it's 
exactly the same chain of events, only different timing (give or take 
which pts was opened).


Looks like I exchanged one kernel bug for another. Maybe I should report 
this to LKML...? And still, if this really is a kernel bug, how could it 
escape the simplest test before the kernel was released?


  Eli

-
Trace for freshly booted system follows:

21:35:21.131094 close(12)   = 0 0.46
21:35:21.131177 close(13)   = 0 0.52
21:35:21.131259 close(4)= 0 0.59
21:35:21.131382 close(8)= 0 0.27
21:35:21.131436 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) *30.022532*

21:35:51.154269 setsid()= 2333 0.000155
21:35:51.154639 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) 0.020142

21:35:51.174931 ioctl(9, TIOCSCTTY) = 0 0.30
21:35:51.175049 rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 
0.28
21:35:51.175176 rt_sigaction(SIGHUP, {SIG_IGN, [], SA_RESTORER, 
0x7fad91c31740}, NULL, 8) = 0 0.28
21:35:51.175309 vhangup()   = -1 EPERM (Operation not 
permitted) 0.26
21:35:51.175407 rt_sigaction(SIGHUP, NULL, {SIG_IGN, [], SA_RESTORER, 
0x7fad91c31740}, 8) = 0 0.27
21:35:51.175524 rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 
0x7fad91c31740}, NULL, 8) = 0 0.28

21:35:51.175642 open(/dev/pts/0, O_RDWR) = 4 *30.063213*
21:36:21.238952 close(9)= 0 0.33
21:36:21.239057 open(/dev/tty, O_WRONLY) = 8 0.021415
21:36:21.260555 close(8)= 0 0.34
21:36:21.260660 dup2(4, 0)  = 0 0.21
21:36:21.260786 dup2(4, 1)  = 1 0.27
21:36:21.260882 dup2(4, 2)  = 2 0.26
21:36:21.260973 close(4)= 0 0.23

(and so on...)

And here's the parallel trace for a warm and cosy system:

21:28:19.163952 close(12)   = 0 0.52
21:28:19.164125 close(13)   = 0 0.51
21:28:19.164295 close(4)= 0 0.43
21:28:19.164435 close(8)= 0 0.55
21:28:19.164577 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) *0.68*

21:28:19.164891 setsid()= 19458 0.22
21:28:19.164990 open(/dev/tty, O_RDWR|O_NOCTTY) = -1 ENXIO (No such 
device or address) 0.66

21:28:19.165132 ioctl(9, TIOCSCTTY) = 0 0.24
21:28:19.165230 rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 
0.43
21:28:19.165379 rt_sigaction(SIGHUP, {SIG_IGN, [], SA_RESTORER, 
0x7f78bf8fe740}, NULL, 8) = 0 0.34
21:28:19.165502 vhangup()   = -1 EPERM (Operation not 
permitted) 0.33
21:28:19.165604 rt_sigaction(SIGHUP, NULL, {SIG_IGN, [], SA_RESTORER, 
0x7f78bf8fe740}, 8) = 0 0.48
21:28:19.165828 rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 
0x7f78bf8fe740}, NULL, 8) = 0 0.38

21:28:19.165937 open(/dev/pts/2, O_RDWR) = 4 *0.27*
21:28:19.166011 close(9)= 0 0.17
21:28:19.166065 open(/dev/tty, O_WRONLY) = 8 0.21
21:28:19.166134 close(8)= 0 0.14
21:28:19.166191 dup2(4, 0)  = 0 0.14
21:28:19.166246 dup2(4, 1)  = 1 0.14
21:28:19.166301 dup2(4, 2)  = 2 0.14
21:28:19.166355 close(4)= 0 0.14

--
Web: http://www.billauer.co.il

___
Haifux mailing list
Haifux@haifux.org
http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux