Re: kernel go-slow

2003-02-06 Thread Alexander Lyamin
Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote:
 I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.
 
 One problem that has started occuring is that periodically some of the 
 machines will go really slow for a while.  It's as if the CPU speed has just 
 dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
 continue as normal.

when its slows down, please check with vmstat for IO or with your
led for disk activity. thats a simply and stupid.

but theres no really good way to understand whats goining on in kernel
if you are userland yourself. so go in kernel with profiling and see
where does it spend it precisious time. slightly more complicated then
method above, but much more effective.

 
 Has anyone heard of such things before?
 
 I am asking here first because the ReiserFS patch is the most significant 
 kernel patch I've applied on what is otherwise a stock 2.4.20 kernel.
 
 Interestingly the machines that have the problems are not the most active in 
 the file system (mail store), but the mail spool machines.  The mail spool 
 machines do a good amount of file access (but well below the limits of the 
 hardware) and also use more memory and have large load spikes on occasion 
 (virus and spam scanning).

-- 
Cache remedies via multi-variable logic shorts will leave you crying.(cl)
Lex Lyamin



Re: kernel go-slow

2003-02-06 Thread Alexander Lyamin
Thu, Feb 06, 2003 at 02:26:49PM +0300, Alexander Lyamin wrote:
 Mon, Feb 03, 2003 at 12:27:40AM +0100, Russell Coker wrote:
  I'm running a number of machines with 2.4.20 and the ReiserFS journal patches.
  
  One problem that has started occuring is that periodically some of the 
  machines will go really slow for a while.  It's as if the CPU speed has just 
  dropped to 1% of it's regular speed.  Then after 10 minutes or so it will 
  continue as normal.
 
 when its slows down, please check with vmstat for IO or with your
i think i wasnt clear enough.
so - first , if you go-slow on a disk activity, chances are good
that it caused by FS or VM or their misunderstandings.

but there is possible situations that will not generate disk activity,
but may cause your system to go-slow, if there you have some 
unussual IO numbers while disk activity is moderate to low -
most likely same sweet pair.

but Oleg Drokin pointed at situations when even IO will not indicate
whats going on :)

so advice is still the same - if you having slowdowns profiling might help
you much better then  withchy methods described above.

 led for disk activity. thats a simply and stupid.
 
 but theres no really good way to understand whats goining on in kernel
 if you are userland yourself. so go in kernel with profiling and see
 where does it spend it precisious time. slightly more complicated then
 method above, but much more effective.
 
  
  Has anyone heard of such things before?
  
  I am asking here first because the ReiserFS patch is the most significant 
  kernel patch I've applied on what is otherwise a stock 2.4.20 kernel.
  
  Interestingly the machines that have the problems are not the most active in 
  the file system (mail store), but the mail spool machines.  The mail spool 
  machines do a good amount of file access (but well below the limits of the 
  hardware) and also use more memory and have large load spikes on occasion 
  (virus and spam scanning).
talking about  virus/spam scanning - what do you use and how its integrated in
your SMTP MTA ?

-- 
Cache remedies via multi-variable logic shorts will leave you crying.(cl)
Lex Lyamin



Re: kernel go-slow

2003-02-06 Thread Russell Coker
On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote:
   One problem that has started occuring is that periodically some of the
   machines will go really slow for a while.  It's as if the CPU speed has
   just dropped to 1% of it's regular speed.  Then after 10 minutes or so
   it will continue as normal.
 
  when its slows down, please check with vmstat for IO or with your

 i think i wasnt clear enough.
 so - first , if you go-slow on a disk activity, chances are good
 that it caused by FS or VM or their misunderstandings.

vmstat doesn't work properly.  CPU time is 99% system which suggests that one 
CPU is spending all it's time in kernel space (for both threads of a 
hyper-threaded CPU) or that both CPUs have each got one thread locked in 
kernel space.

It's not disk related, those machines don't have a huge disk access.  The 
machines with the serious disk activity don't have any problems.

 but there is possible situations that will not generate disk activity,
 but may cause your system to go-slow, if there you have some
 unussual IO numbers while disk activity is moderate to low -
 most likely same sweet pair.

The problem is that sar etc product jumbled results.  Profiling the kernel may 
help, but may also hide the error, and it's not something I can easily do.

The servers are locked in a managed server room on the other side of the city 
so seeing the blinken lights is not an option.

I've put the aa1 kernel on half the machines and now I'll wait to see what 
happens.  If the aa1 machines don't have the problem but the others do then 
I'll go all aa1.

   Interestingly the machines that have the problems are not the most
   active in the file system (mail store), but the mail spool machines. 
   The mail spool machines do a good amount of file access (but well below
   the limits of the hardware) and also use more memory and have large
   load spikes on occasion (virus and spam scanning).

 talking about  virus/spam scanning - what do you use and how its integrated
 in your SMTP MTA ?

RAV.  I'm not sure of the details, I think it runs as a daemon that qmail 
talks to.  I try to avoid the anti-virus stuff.

-- 
http://www.coker.com.au/selinux/   My NSA Security Enhanced Linux packages
http://www.coker.com.au/bonnie++/  Bonnie++ hard drive benchmark
http://www.coker.com.au/postal/Postal SMTP/POP benchmark
http://www.coker.com.au/~russell/  My home page




Re: kernel go-slow

2003-02-06 Thread Oleg Drokin
Hello!

On Thu, Feb 06, 2003 at 05:41:46PM +0100, Russell Coker wrote:

  but there is possible situations that will not generate disk activity,
  but may cause your system to go-slow, if there you have some
  unussual IO numbers while disk activity is moderate to low -
  most likely same sweet pair.
 The problem is that sar etc product jumbled results.  Profiling the kernel may 
 help, but may also hide the error, and it's not something I can easily do.

Well, you can do it very easily.
reboot with profile=2 kernel option.
when 100% sys cpu situation started - execute readprofile -r
when it is finished, execute readprofile -m /path/to/System.map somefile
then sort somefile and you are done, you are now seeing where is most of the time
is spent.

 The servers are locked in a managed server room on the other side of the city 
 so seeing the blinken lights is not an option.

;)
humourwebcam/humour

 I've put the aa1 kernel on half the machines and now I'll wait to see what 
 happens.  If the aa1 machines don't have the problem but the others do then 
 I'll go all aa1.

Ah, if your problem was with highmem I/O not present, then that might actually help.

Bye,
Oleg



Re: when distros do not support official Marcelo kernels they are not being team players (was Re: reiserfs on redhat advanced server?)

2003-02-06 Thread Bernd Schubert
On Tuesday 04 February 2003 23:22, Chris Mason wrote:
 On Tue, 2003-02-04 at 16:45, Hans Reiser wrote:
  The official kernel is our kernel community's only chance to overcome
  its fragmentation.  We need to support it.

 We do support the official kernel, by actively developing and improving
 it, and by answering questions on public mailing lists.  When things
 work in our kernel (like reiserfs, or andrea's vm etc) we work to get it
 into the vanilla kernel.  We're also constantly updating our work to
 keep it in line with the current vanilla sources.  It takes a while due
 to the volume of patches and testing required, but we're always working
 on it.


Hmm, I know its out of topic, but I want to take my chance and complain about 
the Suse kernel.

We are currently testing one of our servers with non-free software and only 
get a pre-compiled kernel module from this company. Unfortunality they only 
have kernel modules for well knows Distros, such as Suse, RedHat, etc and it 
seems to be very difficult to convince them to compile it for a vanilla 
kernel.
So we are using a Suse kernel on Debian. For some reason we couldn't use the 
binary and had to recompile the kernel from the source Suse provides. Well, 
until it came to the module part everything was fine, but then errors 
orrcured and we had to find the config options to disable those modules. 
Since we first tried to use the config-options Suse had set as default, the 
Suse-people *MUST* have seen themselves that it doesn't work this way. 
The other thing I'm strongly wondering about, is what for nomal home users 
need a kernel that has kdb and other very seldom used patches included. Which 
usual home-user (and that is what I thought to be Suse for) needs kernel 
debugging? IMHO kernel debuggers won't have a problem with patching a kernel 
themselves.

Finally we got it working and could even load the module, but later on we 
figured out that nfs file locking for imports from another server isn't 
working properly (/proc/mounts shows that it is enabled, but e.g. 'man' 
complains that it is not). Well, actually I'm wondering that anything works 
in such a stronlgy patched kernel. 
Of course, the nfs file locking works when we use a vanilla kernel.

So whom shall I blame it is not? Suse, and tell them we are only using their 
kernel (and this not even without some force)? And I believe people from the 
kernel mailinglist won't feel responsible, too.

Just a few reasons why I don't like using distros' kernels !

Best regards,
Bernd

PS: The kernel is 2.4.19-4GB, I've forgotten which one of Suses subversions it 
was, but it was rather high (174?).



Re: kernel go-slow

2003-02-06 Thread Hans Reiser
Russell Coker wrote:


On Thu, 6 Feb 2003 17:32, Alexander Lyamin wrote:
 

One problem that has started occuring is that periodically some of the
machines will go really slow for a while.  It's as if the CPU speed has
just dropped to 1% of it's regular speed.  Then after 10 minutes or so
it will continue as normal.
   

when its slows down, please check with vmstat for IO or with your
 

i think i wasnt clear enough.
so - first , if you go-slow on a disk activity, chances are good
that it caused by FS or VM or their misunderstandings.
   


vmstat doesn't work properly.  CPU time is 99% system which suggests that one 
CPU is spending all it's time in kernel space (for both threads of a 
hyper-threaded CPU) or that both CPUs have each got one thread locked in 
kernel space.

 

I propose that you try reversing the datalogging patch for long enough 
to know whether it is our new code that is buggy.

If it is not our code, and it matters enough to justify the cost, we can 
remote login kernel analyze for you for an hourly fee.  Probably the fee 
you charge them is good enough for us too.;-)

--
Hans




Re: link/unlink problem gone?

2003-02-06 Thread Zygo Blaxell
In article [EMAIL PROTECTED],
Oleg Drokin  [EMAIL PROTECTED] wrote:
Hello!

   Sigh, these were false hopes indeed.
   I can reproduce it with 2.4.21-pre4, only it is now harder for some reason.

I've seen times-to-failure ranging from 20 minutes to 20+ hours (!).

Interestingly enough, both extremes occurred back-to-back--I tried
for 20 hours to reproduce the problem, failed, tried again with the
same kernel setup, and 20 minutes later the machine was spewing out
Permission denied too quickly to display.

   Chris: My current idea is it happens during low memory conditions, so I am
   actively running around prune_icache and id's dcache equivalent. Probably
   you can easily reproduce that if you'd have no swap and not very much RAM.

   (Ok, I just checked, limited the RAM to 90M and turned off SWAP entirely.
and reproduced the problem fairly quickly)

I have observed the problem on machines ranging in size from 96 to
512MB RAM.  I haven't observed a correlation between swapping activity
and failures but I haven't been looking for this either.  The machines
that have problems machines are swapping at some time or another (they
have several hundred MB of swap used).

-- 
Zygo Blaxell (Laptop) [EMAIL PROTECTED]
GPG = D13D 6651 F446 9787 600B AD1E CCF3 6F93 2823 44AD



OT: Swapfile to RAM relation with 2.4.2X+

2003-02-06 Thread Manuel Krause
Hi everyone!

Maybe I should address someone else with this question but maybe someone 
on this list can answer this quickly throughout his own experience:

In the beginning of 2.4.0+ a relation of swapfile-to-RAM of 2-to-1 was 
recommended. Due to my several system changes to come in those times I 
refused to implement this setting at that moment of issue immediately 
(money missing for notebook RAM and for required disk space). I had 
256-512MB RAM and always 256MB swap.

Last weekend I implemented a 1:1 relation of Swap-to-RAM, at least 
(512:518 on here, now). I now see swap space filling up more now, and a 
bit more quickly. But not any subjective advantage on my previous 
system. Mmh. Linux is now using more RAM for (sometimes also VMware 
related) disk cache.

I don't know: Is that all on swap:RAM relation?! No real advantage???

Manuel


-
I want to express my personal feelings on here - Me just beeing wordless 
- just praying - concerning the Loss of Columbia: I feel with the 
families, the team; and the future of NASA / ESA / or any further manned 
 space mission to take, mourning and hoping.
I'm also hoping for more peace on earth, in near and far east, we should 
strictly take care of that, in hopefully non-violent diplomacy and no 
aggressions between any party or any ally being in freedom and peace...
-



Re: link/unlink problem gone?

2003-02-06 Thread Oleg Drokin
Hello!

On Thu, Feb 06, 2003 at 05:32:10PM -0500, Zygo Blaxell wrote:

Sigh, these were false hopes indeed.
I can reproduce it with 2.4.21-pre4, only it is now harder for some reason.
 I've seen times-to-failure ranging from 20 minutes to 20+ hours (!).

Same here.

Chris: My current idea is it happens during low memory conditions, so I am
actively running around prune_icache and id's dcache equivalent. Probably
you can easily reproduce that if you'd have no swap and not very much RAM.
 
(Ok, I just checked, limited the RAM to 90M and turned off SWAP entirely.
 and reproduced the problem fairly quickly)
 I have observed the problem on machines ranging in size from 96 to
 512MB RAM.  I haven't observed a correlation between swapping activity
 and failures but I haven't been looking for this either.  The machines

I noticed that with newer 2.4.21-pre kernels first I see processes die because
of OOM and only after that I see direntries pointing to nowhere.
I reproduced this much more than once, so I believe there is some correlation
between these.

 that have problems machines are swapping at some time or another (they
 have several hundred MB of swap used).

And they are just swapping all the time, so it may take a while before
useful code runs and problem happens, it seems.

So far I decided that with SWAP turned off one can reproduce problem more
easily that with SWAP on (especially if swap is large).

Bye,
Oleg