Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
...
 
 This bug causes backup systems to *miss* changed files.
 

This problem is seen with both Amanda and TSM (Tivoli Storage Manager).

A site running Amanda with, say, a full backup weekly and incremental backups
daily, will only get weekly backups of their mmap modified databases.

However, large sites running TSM will be hit even harder by this because TSM
will always perform incremental backups from the client (managing which
versions to keep for how long on the server side). TSM will *never* again take
a backup of the mmap modified database.

The really nasty part is; nothing is failing. Everything *appears* to work.
Your data is just not backed up because it appears to be untouched.

So, if you run TSM (or pretty much any other backup solution actually) on
Linux, maybe you should run a
 find / -type f -print0 | xargs -0 touch
before starting your backup job. Sort of removes the point of using proper
backup software, but at least you'll get your data backed up.


-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
 Since no reaction in LKML was recieved for this message it seemed
 logical to suggest closing the bug #2645 as WONTFIX:
 
 http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

Thank you!

A quick run-down for those who don't know what this is about:

Some applications use mmap() to modify files. Common examples are databases.

Linux does not update the mtime of files that are modified using mmap, even if
msync() is called.

This is very clearly against OpenGroup specifications.

This misfeatures causes such files to be silently *excluded* from normal backup
runs.

Solaris implements this properly.

NT has the same bug as Linux, using their private bastardisation of the mmap
interface - but since they don't care about SuS and are broken in so many other
ways, that really doesn't matter.


So, dear kernel developers, can we please integrate this patch to make Linux
stop silently excluding peoples databases from their backup?

-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
...
  
  Lather, rinse, repeat

Just verified this at one customer site; they had a db that was last backed up
in 2003 :/

 
 On the other hand, updating the mtime and ctime whenever a page is dirtied
 also does not work right.  Apparently that can break mutt.
 

Thinking back on the atime discussion, I bet there would be some performance
problems in updating the ctime/mtime that often too :)

 Calling msync() every once in a while with Anton's patch does not look like a
 fool proof method to me either, because the VM can write all the dirty pages
 to disk by itself, leaving nothing for msync() to detect.  (I think...)

Reading the man page:
The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and
PROT_WRITE will be marked for update at some point in the interval between a
write reference to the mapped region and the next call to  msync() with
MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no
such call, these fields may be marked for update at any time after a write
reference if the underlying file is modified as a result.

So, whenever someone writes in the region, this must cause us to update the
mtime/ctime no later than at the time of the next call to msync().

Could one do something like the lazy atime updates, coupled with a forced flush
at msync()?

 Can we get by with simply updating the ctime and mtime every time msync()
 is called, regardless of whether or not the mmaped pages were still dirty
 by the time we called msync() ?

The update must still happen, eventually, after a write to the mapped region
followed by an unmap/close even if no msync is ever called.

The msync only serves as a no later than deadline. The write to the region
triggers the need for the update.

At least this is how I read the standard - please feel free to correct me if I
am mistaken.

-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Jakob Oestergaard
On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:
...
  I guess a third possible time (if we want to minimize the number of
  updates) would be when natural syncing of the file data to disk, by
  other things in the VM, would be about to clear the I_DIRTY_PAGES
  flag on the inode.  That way we do not need to remember any special
  we already flushed all dirty data, but we have not updated the mtime
  and ctime yet state.
 
  Does this sound reasonable?
 
 No, it doesn't. The msync() system call called with the MS_ASYNC flag
 should (the POSIX standard requires that) update the st_ctime and
 st_mtime stamps in the same manner as for the MS_SYNC flag. However,
 the current implementation of msync() doesn't call the do_fsync()
 function for the MS_ASYNC case. The msync() function may be called
 with the MS_ASYNC flag before natural syncing.

If the update was done as Rik suggested, with the addition that msync()
triggered an explicit sync of the inode data, then everything would be ok,
right?

-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser(2)

2007-11-10 Thread Jakob Oestergaard
...
 I've double-checked the code for any possible off-by-one/overflow 
 errors.
...

Two things caught my eye.

...
 + case bol:
 + case subject:
 + if (*label_len = SMK_MAXLEN)
 + goto out;
 + subjectstr[(*label_len)++] = data[i];

Why is the '' necessary?  Could it happen that you had incremented past the
point of equality?

If that could not happen, then in my oppinion '=' is very misleading when '=='
is really what is needed.

...
 + case object:
 + if (*prevstate == blank) {
 + subjectstr[*label_len] = '\0';
 + *label_len = 0;
 + }

I wonder why it is valid to uncritically use the already incremented label_len
here, without checking its value (like is done above).

It seems strangely asymmetrical. I'm not saying it's wrong, because there may
be a subtle reason as to why it's not, but if that's the case then I think that
subtle reason should be documented with a comment.

...
 + case access:
 + if (*prevstate == blank) {
 + objectstr[*label_len] = '\0';
 + *label_len = 0;
 + }

Same applies here.


-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-09-03 Thread Jakob Oestergaard
On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote:
...
 This is *not* a security hole. In order to make it a security hole, you 
 need to be root in the first place.

Non-root users can write to places where root might believe they cannot write
because he might be under the mistaken assumption that ro means ro.

I am under the impression that that could have implications in some setups.

...
 
  - it's a misfeature that people are used to, and has been around forever.

Sure, they're used it it, but I doubt they are aware of it.

...
 so I really don't see why people excuse the new behaviour.

We can certainly agree that a nicer fix would be nicer :)

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Jakob Oestergaard
On Thu, Aug 30, 2007 at 10:16:37PM -0700, Linus Torvalds wrote:
 
...
  Why aren't we doing that for any other filesystem than NFS?
 
 How hard is it to acknowledge the following little word:
 
   regression
 
 It's simple. You broke things. You may want to fix them, but you need to 
 fix them in a way that does not break user space.

Trond has a point Linus.

What he broke is, for example, a ro mount being mounted as rw.

That *could* be a very serious security (etc.etc.) problem which he just fixed.
Anything depending on read-only not being enforced will cease to work, of
course, and that is what a few people complain about(!).

If ext3 in some rare case (which would still mean it hit a few thousand users)
failed to remember that a file had been marked read-only and allowed writes to
it, wouldn't we want to fix that too?  It would cause regressions, but we'd fix
it, right?

mount passes back the error code on a failed mount. autofs passes that error
along too (when people configure syslog correctly). In short; when these
serious mistakes are made and caught, the admin sees an error in his logs.

This is not wrong. This is good.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Jakob Oestergaard
On Fri, Aug 31, 2007 at 01:07:56AM -0700, Linus Torvalds wrote:
...
 When we add NEW BEHAVIOUR, we don't add it to old interfaces when that 
 breaks old user mode! We add a new flag saying I want the new behaviour.
 
 This is not rocket science, guys. This is very basic kernel behaviour. The 
 kernel exists only to serve user space, and that means that there is no 
 more important thing to do than to make sure you don't break existing 
 users, unless you have some *damns* strong reasons.

100% agreed.

 The fact that he may *also* have broken insane setups is totally 
 irrelevant. Don't go off on some tangent that has nothing to do with the 
 regression in question!

It does not have nothing to do with the regression.

Some setups which worked more by accident than by design earlier on were broken
by the fix. This could have been avoided, I agree, but the breakage was caused
by the fix (or the breakage is the fix, however you prefer to look at it).

  If ext3 in some rare case (which would still mean it hit a few thousand 
  users)
  failed to remember that a file had been marked read-only and allowed writes 
  to
  it, wouldn't we want to fix that too?  It would cause regressions, but we'd 
  fix
  it, right?
 
 Stop blathering. Of course we fix security holes. But we don't break 
 things that don't need breaking. This wasn't a security hole.

*part* of it wasn't a security hole.

The other half very much was.

...
 In other words, it should (as I already mentioned once) have used 
 nosharecache by default, which makes it all work.
 
 Then, people who want to re-use the caches (which in turn may mean that 
 everything needs to have the same flags), THOSE PEOPLE, who want the NEW 
 SEMANTICS (errors and all) should then use a sharecache flag.
 
 See? You don't have to screw people over.

Sure, given that Trond (or whomever) has the time it takes to go and implement
all of this, there's no need to screw anyone.

Assuming he's on a schedule and this will have to wait, I agree with him that
it makes the most sense to play it safe security/consistency-wise rather than
functionality-wise.

  mount passes back the error code on a failed mount. autofs passes that error
  along too (when people configure syslog correctly). In short; when these
  serious mistakes are made and caught, the admin sees an error in his logs.
 
 Bullshit. Seeing the error in his logs doesn't help anything.

It makes troubleshooting possible, which adresses *the* major complaint from
*one* of the *two* people who complained about this.


-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 09:28:05AM +0200, Ingo Molnar wrote:
 
 * Alan Cox [EMAIL PROTECTED] wrote:
 
Can you give examples of backup solutions that rely on atime being 
   updated? I can understand backup tools using mtime/ctime for 
   incremental backups (like tar + Amanda, etc), but I'm having trouble 
   figuring out why someone would want to use atime for that.
  
  HSM is the usual one, and to a large extent probably why Unix 
  originally had atime. Basically migrating less used files away so as 
  to keep the system disks tidy.
 
 atime is used as a _hint_, at most and HSM sure works just fine on an 
 atime-incapable filesystem too. So it's the same deal as add user_xattr 
 mount option to the filesystem to make Beagle index faster. It's now: 
 if you use HSM storage add the atime mount option to make it slightly 
 more intelligent. Expect huge IO slowdowns though.
 
 The only remotely valid compatibility argument would be Mutt - but even 
 that handles it just fine. (we broke way more software via noexec)

I find it pretty normal to use tmpreaper to clear out unused files from
certain types of semi-temporary directory structures. Those files are
often only ever read. They'd start randomly disappearing while in use.

But then again, maybe I'm the only guy on the planet who uses tmpreaper.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 06:42:30AM -0400, Jeff Garzik wrote:
...
 If you can show massive amounts of users that will actually be 
 negatively impacted, please present hard evidence.
 
 Otherwise all this is useless hot air.

Peace Jeff  :)

In another mail, I gave an example with tmpreaper clearing out unused
files; if some of those files are only read and never modified,
tmpreaper would start deleting files which were still frequently used.

That's a regression, the way I see it. As for 'massive amounts of
users', well, tmpreaper exists in most distros, so it's possible it has
other users than just me.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sat, Aug 04, 2007 at 02:08:40PM -0400, Jeff Garzik wrote:
 Linus Torvalds wrote:
 The relatime thing that David mentioned might well be very useful, but 
 it's probably even less used than noatime is. And sadly, I don't really 
 see that changing (unless we were to actually change the defaults inside 
 the kernel).
 
 
 I actually vote for that.  IMO, distros should turn -on- atime updates 
 when they know its needed.

Oh dear.

Why not just make ext3 fsync() a no-op while you're at it?

Distros can turn it back on if it's needed...

Of course I'm not serious, but like atime, fsync() is something one
expects to work if it's there.  Disabling atime updates or making
fsync() a no-op will both result in silent failure which I am sure we
can agree is disasterous.

Why on earth would you cripple the kernel defaults for ext3 (which is a
fine FS for boot/root filesystems), when the *fundamental* problem you
really want to solve lie much deeper in the implementation of the
filesystem?  Noatime doesn't solve the problem, it just makes it less
horrible.

If you really need different filesystem performance characteristics, you
can switch to another filesystem. There's plenty to choose from.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 02:46:48PM +0200, Ingo Molnar wrote:
 
 * Jakob Oestergaard [EMAIL PROTECTED] wrote:
 
   If you can show massive amounts of users that will actually be 
   negatively impacted, please present hard evidence.
   
   Otherwise all this is useless hot air.
  
  Peace Jeff :)
  
  In another mail, I gave an example with tmpreaper clearing out unused 
  files; if some of those files are only read and never modified, 
  tmpreaper would start deleting files which were still frequently used.
  
  That's a regression, the way I see it. As for 'massive amounts of 
  users', well, tmpreaper exists in most distros, so it's possible it 
  has other users than just me.
 
 you mean tmpwatch?

Same same.

 The trivial change below fixes this. And with that 
 we've come to the end of an extremely short list of atime dependencies.

Please read what I wrote, not what you think I wrote.

If I only *read* those files, the mtime will not be updated, only the
atime.

And the files *will* then magically begin to disappear although they are
frequently used.

That will happen with a standard piece of software in a standard
configuration, in a scenario that may or may not be common... I have no
idea how common such a setup is - but I know how much it would suck to
have files magically disappearing because of a kernel upgrade  :)

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-06 Thread Jakob Oestergaard

Hello list,

Setup; 
 NFS server (dual opteron, HW RAID, SCA disk enclosure) on 2.6.11.6
 NFS client (dual PIII) on 2.6.11.6

Both on switched gigabit ethernet - I use NFSv3 over UDP (tried TCP but
this makes no difference).

Problem; during simple tests such as a 'cp largefile0 largefile1' on the
client (under the mountpoint from the NFS server), the client becomes
extremely laggy, NFS writes are slow, and I see very high CPU
utilization by bdflush and rpciod.

For example, writing a single 8G file with dd will give me about
20MB/sec (I get 60+ MB/sec locally on the server), and the client rarely
drops below 40% system CPU utilization.

I tried profiling the client (booting with profile=2), but the profile
traces do not make sense; a profile from a single write test where the
client did not at any time drop below 30% system time (and frequently
were at 40-50%) gives me something like:

raven:~# less profile3 | sort -nr | head
257922 total  2.6394
254739 default_idle 5789.5227
   960 smp_call_function  4.
   888 __might_sleep  5.6923
   569 finish_task_switch 4.7417
   176 kmap_atomic1.7600
   113 __wake_up  1.8833
74 kmap   1.5417
64 kunmap_atomic  5.

The difference between default_idle and total is 1.2% - but I never saw
system CPU utilization under 30%...

Besides, there's basically nothing in the profile that rhymes with
rpciod or bdflush (the two high-hitters on top during the test).

What do I do?

Performance sucks and the profiles do not make sense...

Any suggestions would be greatly appreciated,

Thank you!

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-07 Thread Jakob Oestergaard
On Wed, Apr 06, 2005 at 05:28:56PM -0400, Trond Myklebust wrote:
...
 A look at nfsstat might help, as might netstat -s.
 
 In particular, I suggest looking at the retrans counter in nfsstat.

When doing a 'cp largefile1 largefile2' on the client, I see approx. 10
retransmissions per second in nfsstat.

I don't really know if this is a lot...

I also see packets dropped in ifconfig - approx. 10 per second...  I
wonder if these two are related.

Client has an intel e1000 card - I just set the RX ring buffer to the
max. of 4096 (up from the default of 256), but this doesn't seem to help
a lot (I see the 10 drops/sec with the large RX buffer).

I use NAPI - is there anything else I can do to make the card not drop
packets?   I'm just assuming that this might at least be a part of the
problem, but with large RX ring and NAPI I don't know how much else I
can do to not make the box drop incoming data...

 When you say that TCP did not help, please note that if retrans is high,
 then using TCP with a large value for timeo (for instance -otimeo=600)
 is a good idea. It is IMHO a bug for the mount program to be setting
 default timeout values of less than 30 seconds when using TCP.

I can try that.

Thanks!

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-07 Thread Jakob Oestergaard
On Thu, Apr 07, 2005 at 09:19:06AM +1000, Greg Banks wrote:
...
 How large is the client's RAM? 

2GB - (32 bit kernel because it's dual PIII, so I use highmem)

A few more details:

With standard VM settings, the client will be laggy during the copy, but
it will also have a load average around 10 (!)   And really, the only
thing I do with it is one single 'cp' operation.  The CPU hogs are
pdflush, rpciod/0 and rpciod/1.

I tweaked the VM a bit, put the following in /etc/sysctl.conf:
 vm.dirty_writeback_centisecs=100
 vm.dirty_expire_centisecs=200

The defaults are 500 and 3000 respectively...

This improved things a lot; the client is now almost not very laggy,
and load stays in the saner 1-2 range.

Still, system CPU utilization is very high (still from rpciod and
pdflush - more rpciod and less pdflush though), and the file copying
performance over NFS is roughly half of what I get locally on the server
(8G file copy with 16MB/sec over NFS versus 32 MB/sec locally).

(I run with plenty of knfsd threads on the server, and generally the
server is not very loaded when the client is pounding it as much as it
can)

 What does the following command report
 before and during the write?
 
 egrep 'nfs_page|nfs_write_data' /proc/slabinfo

During the copy I typically see:

nfs_write_data  681   952 480  8 1 : tunables  54 27 8 : slabdata 119 119 108
nfs_page  15639 18300  64 61 1 : tunables 120 60 8 : slabdata 300 300 180

The 18300 above typically goes from 12000 to 25000...

After the copy I see:

nfs_write_data  36  48 480  8 1 : tunables   54   27 8 : slabdata  5  6  0
nfs_page 1  61  64 61 1 : tunables  120   60 8 : slabdata  1  1  0

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-09 Thread Jakob Oestergaard
On Thu, Apr 07, 2005 at 12:17:51PM -0400, Trond Myklebust wrote:
 to den 07.04.2005 Klokka 17:38 (+0200) skreiv Jakob Oestergaard:
 
  I tweaked the VM a bit, put the following in /etc/sysctl.conf:
   vm.dirty_writeback_centisecs=100
   vm.dirty_expire_centisecs=200
  
  The defaults are 500 and 3000 respectively...
  
  This improved things a lot; the client is now almost not very laggy,
  and load stays in the saner 1-2 range.
 
 OK. That hints at what is causing the latencies on the server: I'll bet
 it is the fact that the page reclaim code tries to be clever, and uses
 NFSv3 STABLE writes in order to be able to free up the dirty pages
 immediately. Could you try the following patch, and see if that makes a
 difference too?

The patch alone without the tweaked VM settings doesn't cure the lag - I
think it's better than without the patch (I can actually type this mail
with a large copy running). With the tweaked VM settings too, it's
pretty good - still a little lag, but not enough to really make it
annoying.

Performance is pretty much the same as before (copying an 8GiB file with
15-16MiB/sec - about half the performance of what I get locally on the
file server).

Something that worries me;  It seems that 2.4.25 is a lot faster as NFS
client than 2.6.11.6, most notably on writes - see the following
tiobench results (2000 KiB file, tests with 1, 2 and 4 threads) up
against the same NFS server:

2.4.25:  (dual athlon MP 1.4GHz, 1G RAM, Intel e1000)

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  58.87 54.9% 5.615 5.03% 44.40 44.2% 4.534 8.41%
   . 2000   40962  56.98 58.3% 6.926 6.64% 41.61 58.0% 4.462 10.8%
   . 2000   40964  53.90 59.0% 7.764 9.44% 39.85 61.5% 4.256 10.8%


2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
   . 2000   40962  52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
   . 2000   40964  62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%


44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
this could be improved?

(note; the write performance doesn't change notably with VM tuning nor
with the one-liner change that Trond suggested)

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Sat, Apr 09, 2005 at 05:52:32PM -0400, Trond Myklebust wrote:
 lau den 09.04.2005 Klokka 23:35 (+0200) skreiv Jakob Oestergaard:
 
  2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)
  
   File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
  --- -- --- --- --- --- --- ---
 . 2000   40961  38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
 . 2000   40962  52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
 . 2000   40964  62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%
  
  
  44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
  this could be improved?
 
 What happened to the retransmission rates when you changed to TCP?

tcp with timeo=600 causes retransmits (as seen with nfsstat) to drop to
zero.

 
 Note that on TCP (besides bumping the value for timeo) I would also
 recommend using a full 32k r/wsize instead of 4k (if the network is of
 decent quality, I'd recommend 32k for UDP too).

32k seems to be default for both UDP and TCP.

The network should be of decent quality - e1000 on client, tg3 on
server, both with short cables into a gigabit switch with plenty of
backplane headroom.

 The other tweak you can apply for TCP is to bump the value
 for /proc/sys/sunrpc/tcp_slot_table_entries. That will allow you to have
 several more RPC requests in flight (although that will also tie up more
 threads on the server).

Changing only to TCP gives me:

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  47.04 65.2% 50.57 26.2% 24.24 29.7% 6.896 28.7%
   . 2000   40962  55.77 66.1% 61.72 31.9% 24.13 33.0% 7.646 26.6%
   . 2000   40964  61.94 68.9% 70.52 42.5% 25.65 35.6% 8.042 26.7%

Pretty much the same as before - with writes being suspiciously slow
(compared to good ole' 2.4.25)

With tcp_slot_table_entries bumped to 64 on the client (128 knfsd
threads on the server, same as in all tests), I see:

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  60.50 67.6% 30.12 14.4% 22.54 30.1% 7.075 27.8%
   . 2000   40962  59.87 69.0% 34.34 19.0% 24.09 35.2% 7.805 30.0%
   . 2000   40964  62.27 69.8% 44.87 29.9% 23.07 34.3% 8.239 30.9%

So, reads start off better, it seems, but writes are still half speed of
2.4.25.

I should say that it is common to see a single rpciod thread hogging
100% CPU for 20-30 seconds - that looks suspicious to me, other than
that, I can't really point my finger at anything in this setup.

Any suggestions Trond?   I'd be happy to run some tests for you if you
have any idea how we can speed up those writes (or reads for that
matter, although I am fairly happy with those).


-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 08:35:39AM -0400, Trond Myklebust wrote:
...
 That certainly shouldn't be the case (and isn't on any of my setups). Is
 the behaviour identical same on both the PIII and the Opteron systems?

The dual opteron is the nfs server

The dual athlon is the 2.4 nfs client

The dual PIII is the 2.6 nfs client

 As for the WRITE rates, could you send me a short tcpdump from the
 sequential write section of the above test? Just use tcpdump -s 9
 -w binary.dmp  just for a couple of seconds. I'd like to check the
 latencies, and just check that you are indeed sending unstable writes
 with not too many commit or getattr calls.

Certainly;

http://unthought.net/binary.dmp.bz2

I got an 'invalid snaplen' with the 9 you suggested, the above dump
is done with 9000 - if you need another snaplen please just let me know.

A little explanation for the IPs you see;
 sparrow/10.0.1.20 - nfs server
 raven/10.0.1.7 - 2.6 nfs client
 osprey/10.0.1.13 - NIS/DNS server

Thanks,

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 10:35:25AM -0400, Trond Myklebust wrote:
 må den 11.04.2005 Klokka 15:47 (+0200) skreiv Jakob Oestergaard:
 
  Certainly;
  
  http://unthought.net/binary.dmp.bz2
  
  I got an 'invalid snaplen' with the 9 you suggested, the above dump
  is done with 9000 - if you need another snaplen please just let me know.
 
 So, the RPC itself looks good, but it also looks as if after a while you
 are running into some heavy retransmission problems with TCP too (at the
 TCP level now, instead of at the RPC level). When you get into that
 mode, it looks as if every 2nd or 3rd TCP segment being sent from the
 client is being lost...

Odd...

I'm really sorry for using your time if this ends up being just a
networking problem.

 That can mean either that the server is dropping fragments, or that the
 client is dropping the replies. Can you generate a similar tcpdump on
 the server?

Certainly;  http://unthought.net/sparrow.dmp.bz2


-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 11:21:45AM -0400, Trond Myklebust wrote:
 må den 11.04.2005 Klokka 16:41 (+0200) skreiv Jakob Oestergaard:
 
   That can mean either that the server is dropping fragments, or that the
   client is dropping the replies. Can you generate a similar tcpdump on
   the server?
  
  Certainly;  http://unthought.net/sparrow.dmp.bz2
 
 So, it looks to me as if sparrow is indeed dropping packets (missed
 sequences). Is it running with NAPI enabled too?

Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
option of enabling/disabling RX polling (if we agree that is what we're
talking about), but looking in tg3.c it seems that it *always*
unconditionally uses NAPI...

sparrow:~# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:09:3D:10:BB:1E
  inet addr:10.0.1.20  Bcast:10.0.1.255  Mask:255.255.255.0
  inet6 addr: fe80::209:3dff:fe10:bb1e/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:2304578 errors:0 dropped:0 overruns:0 frame:0
  TX packets:2330829 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2381644307 (2.2 GiB)  TX bytes:2191756317 (2.0 GiB)
  Interrupt:169

No dropped packets... I wonder if the tg3 driver is being completely
honest about this...

Still, 2.4 manages to perform twice as fast against the same server.

And, the 2.6 client still has extremely heavy CPU usage (from rpciod
mainly, which doesn't show up in profiles)

The plot thickens...

Trond (or anyone else feeling they might have some insight they would
like to share on this one), I'll do anything you say (ok, *almost*
anything you say) - any ideas?


-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-12 Thread Jakob Oestergaard
On Tue, Apr 12, 2005 at 11:03:29AM +1000, Greg Banks wrote:
 On Tue, 2005-04-12 at 01:42, Jakob Oestergaard wrote:
  Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
  option of enabling/disabling RX polling (if we agree that is what we're
  talking about), but looking in tg3.c it seems that it *always*
  unconditionally uses NAPI...
 
 I've whined and moaned about this in the past, but for all its
 faults NAPI on tg3 doesn't lose packets.  It does cause a huge
 increase in irq cpu time on multiple fast CPUs.  What irq rate
 are you seeing?

Around 20.000 interrupts per second during the large write, on the IRQ
where eth0 is (this is not shared with anything else).

[sparrow:joe] $ cat /proc/interrupts
   CPU0   CPU1
...
169:3853488  412570512   IO-APIC-level  eth0
...


But still, guys, it is the *same* server with tg3 that runs well with a
2.4 client but poorly with a 2.6 client.

Maybe I'm just staring myself blind at this, but I can't see how a
general problem on the server (such as packet loss, latency or whatever)
would cause no problems with a 2.4 client but major problems with a 2.6
client.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-19 Thread Jakob Oestergaard
On Tue, Apr 12, 2005 at 11:28:43AM +0200, Jakob Oestergaard wrote:
...
 
 But still, guys, it is the *same* server with tg3 that runs well with a
 2.4 client but poorly with a 2.6 client.
 
 Maybe I'm just staring myself blind at this, but I can't see how a
 general problem on the server (such as packet loss, latency or whatever)
 would cause no problems with a 2.4 client but major problems with a 2.6
 client.

Another data point;

I upgraded my mom's machine from an earlier 2.6 (don't remember which,
but I can find out) to 2.6.11.6.

It mounts a home directory from a 2.6.6 NFS server - the client and
server are on a hub'ed 100Mbit network.

On the earlier 2.6 client I/O performance was as one would expect on
hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
MB/sec and decent interactivity.

The typical workload here is storing or retrieving large TIFF files on
the client, while working with other things in KDE. So, if the
large-file NFS I/O causes NFS client stalls, it will be noticable on the
desktop (probably as Konqueror or whatever is accessing configuration or
cache files).

With 2.6.11.6 the client is virtually unusable when large files are
transferred.  A df -h will hang on the mounted filesystem for several
seconds, and I have my mom on the phone complaining that various windows
won't close and that her machine is too slow (*again* it's no more than
half a year ago she got the new P4)  ;)

Now there's plenty of things to start optimizing; RPC over TCP, using a
switch or crossover cable instead of the hub, etc. etc.

However, what changed here was the client kernel going from en earlier
2.6 to 2.6.11.6.

A lot happened to the NFS client in 2.6.11 - I wonder if there's any of
these patches that are worth trying to revert?  I have several setups
that suck currently, so I'm willing to try a thing or two :)

I would try 
---
[EMAIL PROTECTED]
RPC: Convert rpciod into a work queue for greater flexibility.
Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
---
if noone has a better idea...  But that's just a hunch based solely on
my observation of rpciod being a CPU hog on one of the earlier client
systems.  I didn't observe large 'sy' times in vmstat on this client
while it hung on NFS though...

Any suggestions would be greatly appreciated,

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 10 GB in Opteron machine

2005-07-22 Thread Jakob Oestergaard
On Fri, Jul 22, 2005 at 11:31:38AM +0200, Christoph Pleger wrote:
 Hello,
...
  There is no highmem option for the 64-bit kernel, because it doesn't 
  need one.
 
 I have two questions:
 
 1. Is it possible to compile a 64-bit kernel on a 32-bit machine (or at
 least on a 64-bit machine with 32-bit software) and if yes, how can I do
 that?

Yes. On Debian Sarge, I have a few wrapper scripts to accomplish it -
all attached to this mail - just untar them in /usr/local/bin on a
standard x86 32-bit Sarge distro.  Use 'kmake' instead of 'make' when
you are working with your kernel source (eg. 'kmake menuconfig', 'kmake
all')

Sarge comes with all the necessary toolchain support to build a 64-bit
kernel.

It should be equally possible on most other distros of course, I just
haven't felt the urge to go waste my time with them :)

 2. All other software on the machine is 32-bit software. Will that
 software work with a 64-bit kernel?

Yes. You tell your 64-bit kernel to enable 'IA32 Emulation' (under
Executable file formats / Emulations).

This is really the clever way to run a 64-bit system - 99% of what is
commonly run on most systems only gains overhead from the 64-bit address
space - tools like postfix, cron, syslog, apache, ... will not gain from
being native 64-bit.

The kernel however will gain from being 64-bit - and it will easily run
your existing 32-bit apps.

Solaris has done this for ages - maintaining a mostly 32-bit user space,
a 64-bit kernel, and then allowing for certain memory intensive
applications to run natively 64-bit.

It's a nice way to run a Linux based system too, IMO.

-- 

 / jakob



kmake.tar
Description: Unix tar archive


Re: 10 GB in Opteron machine

2005-07-22 Thread Jakob Oestergaard
On Fri, Jul 22, 2005 at 01:37:46PM +0200, Christoph Pleger wrote:
 Hello,
 
...
 I am also using Debian sarge. I extracted the tarfile to /usr/local/bin
 end executed kmake menuconfig. Everything seemed fine so far. But a
 few seconds after starting the compilation (kmake bzImage) I got this
 error message:
 
 In file included from snip
 ...
 snip
 include/asm/mpspec.h:6:25: mach_mpspec.h: No such file or directory

Try a plain 2.6.11.11

 Hm. I understand why that file cannot be found: It only exists in the
 asm-i386 directory. But why does the compilation process look for a file
 that belongs to i386, but not to x86_64?

Kernel source screwed up?  :)

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-20 Thread Jakob Oestergaard
On Tue, Apr 19, 2005 at 06:46:28PM -0400, Trond Myklebust wrote:
 ty den 19.04.2005 Klokka 21:45 (+0200) skreiv Jakob Oestergaard:
 
  It mounts a home directory from a 2.6.6 NFS server - the client and
  server are on a hub'ed 100Mbit network.
  
  On the earlier 2.6 client I/O performance was as one would expect on
  hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
  MB/sec and decent interactivity.
 
 OK, hold it right there...
 
...
 Also, does that hub support NICs that do autonegotiation? (I'll bet the
 answer is no).

*blush*

Ok Trond, you got me there - I don't know why upgrading the client made
the problem much more visible though, but the *server* had negotiated
full duplex rather than half (the client negotiated half ok). Fixing
that on the server side made the client pleasent to work with again.
Mom's a happy camper now again  ;)

Sorry for jumping the gun there...

To get back to the original problem;

I wonder if (as was discussed) the tg3 driver on my NFS server is
dropping packets, causing the 2.6.11 NFS client to misbehave... This
didn't make sense to me before (since earlier clients worked well), but
having just seen this other case where a broken server setup caused
2.6.11 clients to misbehave (where earlier clients were fine), maybe it
could be an explanation...

Will try either changing tg3 driver or putting in an e1000 on my NFS
server - I will let you know about the status on this when I know more.

Thanks all,

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: please remove reserved word new from kernel headers

2005-07-06 Thread Jakob Oestergaard
On Wed, Jul 06, 2005 at 02:26:57AM -0700, Rob Prowel wrote:
 [1.] One line summary of the problem:
 
 2.4 and 2.6 kernel headers use c++ reserved word new
 as identifier in function prototypes.

Correction:

[1.] One line summary of problem:

Userspace application is making use of private kernel headers.

 
 [2.] Full description of the problem/report:
 
 When kernel headers are included in compilation of c++
 programs the compile fails because some header files
 use new in a way that is illegal for c++.  This
 shows up when compiling mySQL under linux 2.6.  It
 uses $KERNELSOURCE/include/asm-i386/system.h.

Corrected:

[2.] Full description of the problem/report:

When userspace applications include headers they shouldn't, all kinds of
problems can appear. One example of this shows up when compiling mySQL
under linux 2.6.  It uses $KERNELSOURCE/include/asm-i386/system.h.

...
 While not an error, per se, it is kind of sloppy and
 it is amazing that it hasn't shown up before now. 

It has shown up, and it has been discussed. Search the archives. I'm
pretty sure the exact problem you're reporting was discussed here a few
months back.

 using the identifier new in kernel headers that are
 visible to applications programs is a bad idea.

Noone's doing that. Because the headers aren't meant to be visible.

This is not a C vs. C++ problem and it has nothing to do with
'sloppiness'. Something much subtler could have happened had it been C
application code which would have parsed cleanly but just broken in
strange ways (due to assumptions in kernel header code which just happen
to not be met in the userspace code). Actually, you should be greatful
someone used 'new' - at least now the error is caught at compile time ;)

It should be simple to fix MySQL to keep it's dirty little fingers off
of the kernel's private parts..

Really, that is the solution.

Take a look at *what* MySQL is doing with the header. If it is the same
problem (I have not double checked, but I guess chances are good) as was
reported earlier, it's really just a small braindamage in MySQL which is
easily fixed (thus removing the need for inclusion of the header in
question).

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS: inode with st_mode == 0

2005-01-16 Thread Jakob Oestergaard
On Sat, Jan 15, 2005 at 01:09:08PM +1100, Nathan Scott wrote:
...
  AFAIK the best you can do is to get the most recent XFS kernel from
  SGI's CVS (this one is based on 2.6.10).
 
 The -mm tree also has these fixes; we'll get them merged into
 mainline soon.

Okeydokey - good

 
  If you run that kernel, then most of the former problems will be gone;
  *) I only have one undeletable directory on my system - so it seems that
  this error is no longer common   ;)
 
 You may need to run xfs_repair to clean that up..?  Or does
 the problem persist after a repair?

I'm running Debian Woody - the xfs_check/xfs_repair there didn't seem to
find anything last I tried. I have not re-checked for this last problem
though.

I figured I might need to run the CVS version of xfs tools, and, well,
me being busy and all, I thought I'd just leave the 'delete_me'
directory hanging until some time I got more time on my hands  ;)

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS: inode with st_mode == 0

2005-01-17 Thread Jakob Oestergaard
On Sun, Jan 16, 2005 at 01:51:12PM +, Christoph Hellwig wrote:
 On Fri, Jan 14, 2005 at 07:23:09PM +0100, Jakob Oestergaard wrote:
  So apart from the general well known instability problems that will
  occur when you actually start *using* the system, there should be no
 
 What known instabilities?

Where should I begin?  ;)

Most of the following have already been posted to LKML - primarily by
Anders ([EMAIL PROTECTED]) - it seems that noone cares, but I'll repost a
summary that Anders sent me below:

---
Scenario 1: Mailservers:
  2.6.10 (~24-40 hours uptime):
  Running ext3 on mailqueue:

SNIP
Unable to handle kernel NULL pointer dereference at virtual address 0004
printing eip:
c018a095
*pde = 
Oops: 0002 [#1]
SMP
Modules linked in: nfs e1000 iptable_nat ipt_connlimit rtc
CPU:2
EIP:0060:[c018a095]Not tainted
EFLAGS: 00010286   (2.6.8.1)
EIP is at journal_commit_transaction+0x535/0x10e5
eax: cac1e26c   ebx:    ecx: f7cec400   edx: f7cec400
esi: f65f3000   edi: cac1e26c   ebp: f65f3000   esp: f65f3dc0
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 174, threadinfo=f65f3000 task=c2308b70)
Stack: f65f3e64      f7cec400 cda565fc
   149a 0004 f65f3e48 c01132d8 0002 c202ad20 0001 f65f3e5c
   c202ad20 c202ad20 0002 0001 001e 01c1af60 f65f3e68 c0407dc0
Call Trace:
 [c01132d8] scheduler_tick+0x468/0x470
 [c01127b5] find_busiest_group+0x105/0x310
 [c011db8e] del_timer_sync+0x7e/0xa0
 [c018cd4d] kjournald+0xbd/0x230
 [c0114b10] autoremove_wake_function+0x0/0x40
 [c0114b10] autoremove_wake_function+0x0/0x40
 [c0103f16] ret_from_fork+0x6/0x14
 [c018cc70] commit_timeout+0x0/0x10
 [c018cc90] kjournald+0x0/0x230
 [c01024bd] kernel_thread_helper+0x5/0x18
Code: f0 ff 43 04 8b 03 83 e0 04 74 4c 8b 8c 24 b8 01 00 00 c6 81
 2SoftDog: Initiating system reboot
/SNIP

---
Scenario 2: Mailservers:
  Running XFS on mailqueue:

SNIP
Filesystem sdb1: xfs_trans_delete_ail: attempting to delete a log item that 
is not in the AIL
xfs_force_shutdown(sdb1,0x8) called from line 382 of file 
fs/xfs/xfs_trans_ail.c.  Return address = 0xc0216a56
@Linux version 2.6.9 ([EMAIL PROTECTED]) (gcc version 2.96 2731 (Red 
Hat Linux 7.3 2.96-113)) #1 SMP Tue Oct 19 16:04:55 CEST 2004
/SNIP


===
Resolution to the mailserver problem:
 2.4.28 is perfectly stable on these machines.

---
Scenario 3: Webservers:

  2.6.10, 2.6.10-ac8 (~3-12 hours uptime):

SNIP
Unable to handle kernel paging request
2SoftDog: Initiating system reboot.
SNIP
(No more...) :(

===
Resolution to the webserver problem:
 2.4.28/2.4.29-rc2 are stable here

---
Scenario 4: Storageservers: 
  2.6.8.1:
Oopses after ~5-10 hours whith SMP on. - Cannot find the actual Oopses 
anymore and 2.6.8+ havent been tested as we cannot afford anymore downtime on 
these servers.


===
Resolution to the storage server problem:
 2.6.8.1 UP is stable (but oopses regularly after memory allocation
 failures)



Hardware on all servers: IBM x335 and x345.

Mentioned errors seen on a total of 17 servers.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: raid 1 - automatic 'repair' possible?

2005-01-19 Thread Jakob Oestergaard
On Wed, Jan 19, 2005 at 11:48:52AM +0100, Kiniger wrote:
...
 some random thoughts:
 
 nowadays hardware sector sizes are much bigger than 512 bytes

No :)

 and
 the read error may affect some sectors +- the sector which actually
 returned the error.

That's right

 
 to keep the handling in userspace as much as possible: 
 
 the real problem is the long resync time. therefore it would
 be sufficient to have a concept of defective areas per partition
 and drive (a few of them, perhaps four or so , would be enough) 
 which will be excluded from reads/writes and some means to
 re-synchronize these defective areas from the good counterparts
 of the other disk. This would avoid having the whole partition being
 marked as defective.

I wonder if it's really worth it.

The original idea has some merit I think - but what you're suggesting
here is almost bad block remapping with transparent recovery and user
space policy agents etc. etc.

If a drive has problems reading the platter, it can usually be corrected
by overwriting the given sector (either the drive can actually overwrite
the sector in place, or it will re-allocate it with severe read
performance penalties following). But there's a reason why that sector
went bad, and you realy want to get the disk replaced.

I think the current policy of marking the disk as failed when it has
failed is sensible.

Just my 0.02 Euro

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting rid of SHMMAX/SHMALL ?

2005-08-04 Thread Jakob Oestergaard
On Thu, Aug 04, 2005 at 02:19:21PM +0100, Hugh Dickins wrote:
...
  Even on 32bit architectures it is far too small and doesn't
  make much sense. Does anybody remember why we even have this limit?
 
 To be like the UNIXes.

 :)

...
 Anton proposed raising the limits last autumn, but I was a bit
 discouraging back then, having noticed that even Solaris 9 was more
 restrictive than Linux.  They seem to be ancient traditional limits
 which everyone knows must be raised to get real work done.

As I understand it (and I may be mistaken - if so please let me know) -
the limit is for SVR4 IPC shared memory (shmget() and friends), and not
shared memory in general.

It makes good sense to limit use of the old SVR4 shared memory
ressources, as they're generally administrator hell (doesn't free up
ressources on process exit), and just plain shouldn't be used.

It is my impression that SVR4 shmem is used in very few applications,
and that the low limit is more than sufficient in most cases.

Any proper application that really needs shared memory, can either
memory map /dev/null and share that map (swap backed shared memory) or
memory map a file on disk.

If the above makes sense and isn't too far from the truth, then I guess
that's a pretty good argument for maintaining status quo.

-- 

 / jakob

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Thu, Jan 17, 2008 at 01:25:39PM -0800, Linus Torvalds wrote:
...
 Why do you make that mistake, when it is PROVABLY NOT TRUE!
 
 Try this trivial program:
 
   int main(int argc, char **argv)
   {
   int i;
   const int *c;
   
   i = 5;
   c = i;
   i = 10;
   return *c;
   }
 
 and realize that according to the C rules, if it returns anything but 10, 
 the compiler is *buggy*.

That's not how this works (as we obviously agree).

Please consider a rewrite of your example, demonstrating the usefulness and
proper application of const pointers:

extern foo(const int *);

int main(int argc, char **argv)
{
 int i;

 i = 5;
 foo(i);
 return i;
}

Now, if the program returns anything else than 5, it means someone cast away
const, which is generally considered a bad idea in most other software
projects, for this very reason.

*That* is the purpose of const pointers.

Besides, for most debugging-enabled free() implementations, free() does indeed
touch the memory pointed to by its argument, which makes giving it a const
pointer completely bogus except for a single potential optimized special-case
where it might actually not touch the memory.

-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Fri, Jan 18, 2008 at 02:31:16PM +0100, Björn Steinbrink wrote:
...
 
 Do you see anything that casts the const away? No? Me neither. Still,
 the memory that p points to was changed, because there was another
 pointer and that was not const.

*another* being key here.

 
  *That* is the purpose of const pointers.
 
 The only thing that const can tell you is that you should not modify the
 value _yourself_, using that pointer _directly_.

Which is pretty damn useful.

Think about it.  Don't you ever use const?  Is it ever only in the way?

...
{snip long explanation about how one can avoid the benefits of const, without
using casts}
...
 If you want to restrict the set of pointers that can be invalidated by
 an other pointer, you'll have to use something else because const does
 not talk about invalidating aliasing pointers.

Precisely, so why are we discussing this?

I claim that const is useful. You claim that it can't solve all the worlds
problems. I agree with that, but I maintain it is still useful.

But, in order for it to be useful, it requires that people do not circumvent it
in the wrong places (such as kfree).

-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Fri, Jan 18, 2008 at 12:47:01PM +0100, Giacomo A. Catenazzi wrote:
...
 restrict exists for this reason. const is only about lvalue.

You think that I try to put more meaning into const than I do - but I don't.

Please read what I wrote, not what you want to think I wrote.

I agree that if I said what you seem to imply I said, then I would have been
wrong. But I didn't so I'm not ;)


-- 

 / jakob

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-06 Thread Jakob Oestergaard

Hello list,

Setup; 
 NFS server (dual opteron, HW RAID, SCA disk enclosure) on 2.6.11.6
 NFS client (dual PIII) on 2.6.11.6

Both on switched gigabit ethernet - I use NFSv3 over UDP (tried TCP but
this makes no difference).

Problem; during simple tests such as a 'cp largefile0 largefile1' on the
client (under the mountpoint from the NFS server), the client becomes
extremely laggy, NFS writes are slow, and I see very high CPU
utilization by bdflush and rpciod.

For example, writing a single 8G file with dd will give me about
20MB/sec (I get 60+ MB/sec locally on the server), and the client rarely
drops below 40% system CPU utilization.

I tried profiling the client (booting with profile=2), but the profile
traces do not make sense; a profile from a single write test where the
client did not at any time drop below 30% system time (and frequently
were at 40-50%) gives me something like:

raven:~# less profile3 | sort -nr | head
257922 total  2.6394
254739 default_idle 5789.5227
   960 smp_call_function  4.
   888 __might_sleep  5.6923
   569 finish_task_switch 4.7417
   176 kmap_atomic1.7600
   113 __wake_up  1.8833
74 kmap   1.5417
64 kunmap_atomic  5.

The difference between default_idle and total is 1.2% - but I never saw
system CPU utilization under 30%...

Besides, there's basically nothing in the profile that rhymes with
rpciod or bdflush (the two high-hitters on top during the test).

What do I do?

Performance sucks and the profiles do not make sense...

Any suggestions would be greatly appreciated,

Thank you!

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-07 Thread Jakob Oestergaard
On Wed, Apr 06, 2005 at 05:28:56PM -0400, Trond Myklebust wrote:
...
> A look at "nfsstat" might help, as might "netstat -s".
> 
> In particular, I suggest looking at the "retrans" counter in nfsstat.

When doing a 'cp largefile1 largefile2' on the client, I see approx. 10
retransmissions per second in nfsstat.

I don't really know if this is a lot...

I also see packets dropped in ifconfig - approx. 10 per second...  I
wonder if these two are related.

Client has an intel e1000 card - I just set the RX ring buffer to the
max. of 4096 (up from the default of 256), but this doesn't seem to help
a lot (I see the 10 drops/sec with the large RX buffer).

I use NAPI - is there anything else I can do to make the card not drop
packets?   I'm just assuming that this might at least be a part of the
problem, but with large RX ring and NAPI I don't know how much else I
can do to not make the box drop incoming data...

> When you say that TCP did not help, please note that if retrans is high,
> then using TCP with a large value for timeo (for instance -otimeo=600)
> is a good idea. It is IMHO a bug for the "mount" program to be setting
> default timeout values of less than 30 seconds when using TCP.

I can try that.

Thanks!

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-07 Thread Jakob Oestergaard
On Thu, Apr 07, 2005 at 09:19:06AM +1000, Greg Banks wrote:
...
> How large is the client's RAM? 

2GB - (32 bit kernel because it's dual PIII, so I use highmem)

A few more details:

With standard VM settings, the client will be laggy during the copy, but
it will also have a load average around 10 (!)   And really, the only
thing I do with it is one single 'cp' operation.  The CPU hogs are
pdflush, rpciod/0 and rpciod/1.

I tweaked the VM a bit, put the following in /etc/sysctl.conf:
 vm.dirty_writeback_centisecs=100
 vm.dirty_expire_centisecs=200

The defaults are 500 and 3000 respectively...

This improved things a lot; the client is now "almost not very laggy",
and load stays in the saner 1-2 range.

Still, system CPU utilization is very high (still from rpciod and
pdflush - more rpciod and less pdflush though), and the file copying
performance over NFS is roughly half of what I get locally on the server
(8G file copy with 16MB/sec over NFS versus 32 MB/sec locally).

(I run with plenty of knfsd threads on the server, and generally the
server is not very loaded when the client is pounding it as much as it
can)

> What does the following command report
> before and during the write?
> 
> egrep 'nfs_page|nfs_write_data' /proc/slabinfo

During the copy I typically see:

nfs_write_data  681   952 480  8 1 : tunables  54 27 8 : slabdata 119 119 108
nfs_page  15639 18300  64 61 1 : tunables 120 60 8 : slabdata 300 300 180

The "18300" above typically goes from 12000 to 25000...

After the copy I see:

nfs_write_data  36  48 480  8 1 : tunables   54   27 8 : slabdata  5  6  0
nfs_page 1  61  64 61 1 : tunables  120   60 8 : slabdata  1  1  0

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-09 Thread Jakob Oestergaard
On Thu, Apr 07, 2005 at 12:17:51PM -0400, Trond Myklebust wrote:
> to den 07.04.2005 Klokka 17:38 (+0200) skreiv Jakob Oestergaard:
> 
> > I tweaked the VM a bit, put the following in /etc/sysctl.conf:
> >  vm.dirty_writeback_centisecs=100
> >  vm.dirty_expire_centisecs=200
> > 
> > The defaults are 500 and 3000 respectively...
> > 
> > This improved things a lot; the client is now "almost not very laggy",
> > and load stays in the saner 1-2 range.
> 
> OK. That hints at what is causing the latencies on the server: I'll bet
> it is the fact that the page reclaim code tries to be clever, and uses
> NFSv3 STABLE writes in order to be able to free up the dirty pages
> immediately. Could you try the following patch, and see if that makes a
> difference too?

The patch alone without the tweaked VM settings doesn't cure the lag - I
think it's better than without the patch (I can actually type this mail
with a large copy running). With the tweaked VM settings too, it's
pretty good - still a little lag, but not enough to really make it
annoying.

Performance is pretty much the same as before (copying an 8GiB file with
15-16MiB/sec - about half the performance of what I get locally on the
file server).

Something that worries me;  It seems that 2.4.25 is a lot faster as NFS
client than 2.6.11.6, most notably on writes - see the following
tiobench results (2000 KiB file, tests with 1, 2 and 4 threads) up
against the same NFS server:

2.4.25:  (dual athlon MP 1.4GHz, 1G RAM, Intel e1000)

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  58.87 54.9% 5.615 5.03% 44.40 44.2% 4.534 8.41%
   . 2000   40962  56.98 58.3% 6.926 6.64% 41.61 58.0% 4.462 10.8%
   . 2000   40964  53.90 59.0% 7.764 9.44% 39.85 61.5% 4.256 10.8%


2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
   . 2000   40962  52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
   . 2000   40964  62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%


44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
this could be improved?

(note; the write performance doesn't change notably with VM tuning nor
with the one-liner change that Trond suggested)

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Sat, Apr 09, 2005 at 05:52:32PM -0400, Trond Myklebust wrote:
> lau den 09.04.2005 Klokka 23:35 (+0200) skreiv Jakob Oestergaard:
> 
> > 2.6.11.6: (dual PIII 1GHz, 2G RAM, Intel e1000)
> > 
> >  File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
> >   DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> > --- -- --- --- --- --- --- ---
> >. 2000   40961  38.34 18.8% 19.61 6.77% 22.53 23.4% 6.947 15.6%
> >. 2000   40962  52.82 29.0% 24.42 9.37% 24.20 27.0% 7.755 16.7%
> >. 2000   40964  62.48 34.8% 33.65 17.0% 24.73 27.6% 8.027 15.4%
> > 
> > 
> > 44MiB/sec for 2.4 versus 22MiB/sec for 2.6 - any suggestions as to how
> > this could be improved?
> 
> What happened to the retransmission rates when you changed to TCP?

tcp with timeo=600 causes retransmits (as seen with nfsstat) to drop to
zero.

> 
> Note that on TCP (besides bumping the value for timeo) I would also
> recommend using a full 32k r/wsize instead of 4k (if the network is of
> decent quality, I'd recommend 32k for UDP too).

32k seems to be default for both UDP and TCP.

The network should be of decent quality - e1000 on client, tg3 on
server, both with short cables into a gigabit switch with plenty of
backplane headroom.

> The other tweak you can apply for TCP is to bump the value
> for /proc/sys/sunrpc/tcp_slot_table_entries. That will allow you to have
> several more RPC requests in flight (although that will also tie up more
> threads on the server).

Changing only to TCP gives me:

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  47.04 65.2% 50.57 26.2% 24.24 29.7% 6.896 28.7%
   . 2000   40962  55.77 66.1% 61.72 31.9% 24.13 33.0% 7.646 26.6%
   . 2000   40964  61.94 68.9% 70.52 42.5% 25.65 35.6% 8.042 26.7%

Pretty much the same as before - with writes being suspiciously slow
(compared to good ole' 2.4.25)

With tcp_slot_table_entries bumped to 64 on the client (128 knfsd
threads on the server, same as in all tests), I see:

 File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
  DirSize   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- -- --- --- --- --- --- ---
   . 2000   40961  60.50 67.6% 30.12 14.4% 22.54 30.1% 7.075 27.8%
   . 2000   40962  59.87 69.0% 34.34 19.0% 24.09 35.2% 7.805 30.0%
   . 2000   40964  62.27 69.8% 44.87 29.9% 23.07 34.3% 8.239 30.9%

So, reads start off better, it seems, but writes are still half speed of
2.4.25.

I should say that it is common to see a single rpciod thread hogging
100% CPU for 20-30 seconds - that looks suspicious to me, other than
that, I can't really point my finger at anything in this setup.

Any suggestions Trond?   I'd be happy to run some tests for you if you
have any idea how we can speed up those writes (or reads for that
matter, although I am fairly happy with those).


-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 08:35:39AM -0400, Trond Myklebust wrote:
...
> That certainly shouldn't be the case (and isn't on any of my setups). Is
> the behaviour identical same on both the PIII and the Opteron systems?

The dual opteron is the nfs server

The dual athlon is the 2.4 nfs client

The dual PIII is the 2.6 nfs client

> As for the WRITE rates, could you send me a short tcpdump from the
> "sequential write" section of the above test? Just use "tcpdump -s 9
> -w binary.dmp"  just for a couple of seconds. I'd like to check the
> latencies, and just check that you are indeed sending unstable writes
> with not too many commit or getattr calls.

Certainly;

http://unthought.net/binary.dmp.bz2

I got an 'invalid snaplen' with the 9 you suggested, the above dump
is done with 9000 - if you need another snaplen please just let me know.

A little explanation for the IPs you see;
 sparrow/10.0.1.20 - nfs server
 raven/10.0.1.7 - 2.6 nfs client
 osprey/10.0.1.13 - NIS/DNS server

Thanks,

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 10:35:25AM -0400, Trond Myklebust wrote:
> må den 11.04.2005 Klokka 15:47 (+0200) skreiv Jakob Oestergaard:
> 
> > Certainly;
> > 
> > http://unthought.net/binary.dmp.bz2
> > 
> > I got an 'invalid snaplen' with the 9 you suggested, the above dump
> > is done with 9000 - if you need another snaplen please just let me know.
> 
> So, the RPC itself looks good, but it also looks as if after a while you
> are running into some heavy retransmission problems with TCP too (at the
> TCP level now, instead of at the RPC level). When you get into that
> mode, it looks as if every 2nd or 3rd TCP segment being sent from the
> client is being lost...

Odd...

I'm really sorry for using your time if this ends up being just a
networking problem.

> That can mean either that the server is dropping fragments, or that the
> client is dropping the replies. Can you generate a similar tcpdump on
> the server?

Certainly;  http://unthought.net/sparrow.dmp.bz2


-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-11 Thread Jakob Oestergaard
On Mon, Apr 11, 2005 at 11:21:45AM -0400, Trond Myklebust wrote:
> må den 11.04.2005 Klokka 16:41 (+0200) skreiv Jakob Oestergaard:
> 
> > > That can mean either that the server is dropping fragments, or that the
> > > client is dropping the replies. Can you generate a similar tcpdump on
> > > the server?
> > 
> > Certainly;  http://unthought.net/sparrow.dmp.bz2
> 
> So, it looks to me as if "sparrow" is indeed dropping packets (missed
> sequences). Is it running with NAPI enabled too?

Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
option of enabling/disabling RX polling (if we agree that is what we're
talking about), but looking in tg3.c it seems that it *always*
unconditionally uses NAPI...

sparrow:~# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:09:3D:10:BB:1E
  inet addr:10.0.1.20  Bcast:10.0.1.255  Mask:255.255.255.0
  inet6 addr: fe80::209:3dff:fe10:bb1e/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:2304578 errors:0 dropped:0 overruns:0 frame:0
  TX packets:2330829 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2381644307 (2.2 GiB)  TX bytes:2191756317 (2.0 GiB)
  Interrupt:169

No dropped packets... I wonder if the tg3 driver is being completely
honest about this...

Still, 2.4 manages to perform twice as fast against the same server.

And, the 2.6 client still has extremely heavy CPU usage (from rpciod
mainly, which doesn't show up in profiles)

The plot thickens...

Trond (or anyone else feeling they might have some insight they would
like to share on this one), I'll do anything you say (ok, *almost*
anything you say) - any ideas?


-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-12 Thread Jakob Oestergaard
On Tue, Apr 12, 2005 at 11:03:29AM +1000, Greg Banks wrote:
> On Tue, 2005-04-12 at 01:42, Jakob Oestergaard wrote:
> > Yes, as far as I know - the Broadcom Tigeon3 driver does not have the
> > option of enabling/disabling RX polling (if we agree that is what we're
> > talking about), but looking in tg3.c it seems that it *always*
> > unconditionally uses NAPI...
> 
> I've whined and moaned about this in the past, but for all its
> faults NAPI on tg3 doesn't lose packets.  It does cause a huge
> increase in irq cpu time on multiple fast CPUs.  What irq rate
> are you seeing?

Around 20.000 interrupts per second during the large write, on the IRQ
where eth0 is (this is not shared with anything else).

[sparrow:joe] $ cat /proc/interrupts
   CPU0   CPU1
...
169:3853488  412570512   IO-APIC-level  eth0
...


But still, guys, it is the *same* server with tg3 that runs well with a
2.4 client but poorly with a 2.6 client.

Maybe I'm just staring myself blind at this, but I can't see how a
general problem on the server (such as packet loss, latency or whatever)
would cause no problems with a 2.4 client but major problems with a 2.6
client.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-19 Thread Jakob Oestergaard
On Tue, Apr 12, 2005 at 11:28:43AM +0200, Jakob Oestergaard wrote:
...
> 
> But still, guys, it is the *same* server with tg3 that runs well with a
> 2.4 client but poorly with a 2.6 client.
> 
> Maybe I'm just staring myself blind at this, but I can't see how a
> general problem on the server (such as packet loss, latency or whatever)
> would cause no problems with a 2.4 client but major problems with a 2.6
> client.

Another data point;

I upgraded my mom's machine from an earlier 2.6 (don't remember which,
but I can find out) to 2.6.11.6.

It mounts a home directory from a 2.6.6 NFS server - the client and
server are on a hub'ed 100Mbit network.

On the earlier 2.6 client I/O performance was as one would expect on
hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
MB/sec and decent interactivity.

The typical workload here is storing or retrieving large TIFF files on
the client, while working with other things in KDE. So, if the
large-file NFS I/O causes NFS client stalls, it will be noticable on the
desktop (probably as Konqueror or whatever is accessing configuration or
cache files).

With 2.6.11.6 the client is virtually unusable when large files are
transferred.  A "df -h" will hang on the mounted filesystem for several
seconds, and I have my mom on the phone complaining that various windows
won't close and that her machine is too slow (*again* it's no more than
half a year ago she got the new P4)  ;)

Now there's plenty of things to start optimizing; RPC over TCP, using a
switch or crossover cable instead of the hub, etc. etc.

However, what changed here was the client kernel going from en earlier
2.6 to 2.6.11.6.

A lot happened to the NFS client in 2.6.11 - I wonder if there's any of
these patches that are worth trying to revert?  I have several setups
that suck currently, so I'm willing to try a thing or two :)

I would try 
---
<[EMAIL PROTECTED]>
RPC: Convert rpciod into a work queue for greater flexibility.
Signed-off-by: Trond Myklebust <[EMAIL PROTECTED]>
---
if noone has a better idea...  But that's just a hunch based solely on
my observation of rpciod being a CPU hog on one of the earlier client
systems.  I didn't observe large 'sy' times in vmstat on this client
while it hung on NFS though...

Any suggestions would be greatly appreciated,

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: bdflush/rpciod high CPU utilization, profile does not make sense

2005-04-20 Thread Jakob Oestergaard
On Tue, Apr 19, 2005 at 06:46:28PM -0400, Trond Myklebust wrote:
> ty den 19.04.2005 Klokka 21:45 (+0200) skreiv Jakob Oestergaard:
> 
> > It mounts a home directory from a 2.6.6 NFS server - the client and
> > server are on a hub'ed 100Mbit network.
> > 
> > On the earlier 2.6 client I/O performance was as one would expect on
> > hub'ed 100Mbit - meaning, not exactly stellar, but you'd get around 4-5
> > MB/sec and decent interactivity.
> 
> OK, hold it right there...
> 
...
> Also, does that hub support NICs that do autonegotiation? (I'll bet the
> answer is "no").

*blush*

Ok Trond, you got me there - I don't know why upgrading the client made
the problem much more visible though, but the *server* had negotiated
full duplex rather than half (the client negotiated half ok). Fixing
that on the server side made the client pleasent to work with again.
Mom's a happy camper now again  ;)

Sorry for jumping the gun there...

To get back to the original problem;

I wonder if (as was discussed) the tg3 driver on my NFS server is
dropping packets, causing the 2.6.11 NFS client to misbehave... This
didn't make sense to me before (since earlier clients worked well), but
having just seen this other case where a broken server setup caused
2.6.11 clients to misbehave (where earlier clients were fine), maybe it
could be an explanation...

Will try either changing tg3 driver or putting in an e1000 on my NFS
server - I will let you know about the status on this when I know more.

Thanks all,

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: please remove reserved word "new" from kernel headers

2005-07-06 Thread Jakob Oestergaard
On Wed, Jul 06, 2005 at 02:26:57AM -0700, Rob Prowel wrote:
> [1.] One line summary of the problem:
> 
> 2.4 and 2.6 kernel headers use c++ reserved word "new"
> as identifier in function prototypes.

Correction:

[1.] One line summary of problem:

Userspace application is making use of private kernel headers.

> 
> [2.] Full description of the problem/report:
> 
> When kernel headers are included in compilation of c++
> programs the compile fails because some header files
> use "new" in a way that is illegal for c++.  This
> shows up when compiling mySQL under linux 2.6.  It
> uses $KERNELSOURCE/include/asm-i386/system.h.

Corrected:

[2.] Full description of the problem/report:

When userspace applications include headers they shouldn't, all kinds of
problems can appear. One example of this shows up when compiling mySQL
under linux 2.6.  It uses $KERNELSOURCE/include/asm-i386/system.h.

...
> While not an error, per se, it is kind of sloppy and
> it is amazing that it hasn't shown up before now. 

It has shown up, and it has been discussed. Search the archives. I'm
pretty sure the exact problem you're reporting was discussed here a few
months back.

> using the identifier "new" in kernel headers that are
> visible to applications programs is a bad idea.

Noone's doing that. Because the headers aren't meant to be visible.

This is not a C vs. C++ problem and it has nothing to do with
'sloppiness'. Something much subtler could have happened had it been C
application code which would have parsed cleanly but just broken in
strange ways (due to assumptions in kernel header code which just happen
to not be met in the userspace code). Actually, you should be greatful
someone used 'new' - at least now the error is caught at compile time ;)

It should be simple to fix MySQL to keep it's dirty little fingers off
of the kernel's private parts..

Really, that is the solution.

Take a look at *what* MySQL is doing with the header. If it is the same
problem (I have not double checked, but I guess chances are good) as was
reported earlier, it's really just a small braindamage in MySQL which is
easily fixed (thus removing the need for inclusion of the header in
question).

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Getting rid of SHMMAX/SHMALL ?

2005-08-04 Thread Jakob Oestergaard
On Thu, Aug 04, 2005 at 02:19:21PM +0100, Hugh Dickins wrote:
...
> > Even on 32bit architectures it is far too small and doesn't
> > make much sense. Does anybody remember why we even have this limit?
> 
> To be like the UNIXes.

 :)

...
> Anton proposed raising the limits last autumn, but I was a bit
> discouraging back then, having noticed that even Solaris 9 was more
> restrictive than Linux.  They seem to be ancient traditional limits
> which everyone knows must be raised to get real work done.

As I understand it (and I may be mistaken - if so please let me know) -
the limit is for SVR4 IPC shared memory (shmget() and friends), and not
shared memory in general.

It makes good sense to limit use of the old SVR4 shared memory
ressources, as they're generally administrator hell (doesn't free up
ressources on process exit), and just plain shouldn't be used.

It is my impression that SVR4 shmem is used in very few applications,
and that the low limit is more than sufficient in most cases.

Any proper application that really needs shared memory, can either
memory map /dev/null and share that map (swap backed shared memory) or
memory map a file on disk.

If the above makes sense and isn't too far from the truth, then I guess
that's a pretty good argument for maintaining status quo.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 10 GB in Opteron machine

2005-07-22 Thread Jakob Oestergaard
On Fri, Jul 22, 2005 at 11:31:38AM +0200, Christoph Pleger wrote:
> Hello,
...
> > There is no highmem option for the 64-bit kernel, because it doesn't 
> > need one.
> 
> I have two questions:
> 
> 1. Is it possible to compile a 64-bit kernel on a 32-bit machine (or at
> least on a 64-bit machine with 32-bit software) and if yes, how can I do
> that?

Yes. On Debian Sarge, I have a few wrapper scripts to accomplish it -
all attached to this mail - just untar them in /usr/local/bin on a
standard x86 32-bit Sarge distro.  Use 'kmake' instead of 'make' when
you are working with your kernel source (eg. 'kmake menuconfig', 'kmake
all')

Sarge comes with all the necessary toolchain support to build a 64-bit
kernel.

It should be equally possible on most other distros of course, I just
haven't felt the urge to go waste my time with them :)

> 2. All other software on the machine is 32-bit software. Will that
> software work with a 64-bit kernel?

Yes. You tell your 64-bit kernel to enable 'IA32 Emulation' (under
Executable file formats / Emulations).

This is really the clever way to run a 64-bit system - 99% of what is
commonly run on most systems only gains overhead from the 64-bit address
space - tools like postfix, cron, syslog, apache, ... will not gain from
being native 64-bit.

The kernel however will gain from being 64-bit - and it will easily run
your existing 32-bit apps.

Solaris has done this for ages - maintaining a mostly 32-bit user space,
a 64-bit kernel, and then allowing for certain memory intensive
applications to run natively 64-bit.

It's a nice way to run a Linux based system too, IMO.

-- 

 / jakob



kmake.tar
Description: Unix tar archive


Re: 10 GB in Opteron machine

2005-07-22 Thread Jakob Oestergaard
On Fri, Jul 22, 2005 at 01:37:46PM +0200, Christoph Pleger wrote:
> Hello,
> 
...
> I am also using Debian sarge. I extracted the tarfile to /usr/local/bin
> end executed "kmake menuconfig". Everything seemed fine so far. But a
> few seconds after starting the compilation (kmake bzImage) I got this
> error message:
> 
> In file included from 
> ...
> 
> include/asm/mpspec.h:6:25: mach_mpspec.h: No such file or directory

Try a plain 2.6.11.11

> Hm. I understand why that file cannot be found: It only exists in the
> asm-i386 directory. But why does the compilation process look for a file
> that belongs to i386, but not to x86_64?

Kernel source screwed up?  :)

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS: inode with st_mode == 0

2005-01-16 Thread Jakob Oestergaard
On Sat, Jan 15, 2005 at 01:09:08PM +1100, Nathan Scott wrote:
...
> > AFAIK the best you can do is to get the most recent XFS kernel from
> > SGI's CVS (this one is based on 2.6.10).
> 
> The -mm tree also has these fixes; we'll get them merged into
> mainline soon.

Okeydokey - good

> 
> > If you run that kernel, then most of the former problems will be gone;
> > *) I only have one undeletable directory on my system - so it seems that
> > this error is no longer common   ;)
> 
> You may need to run xfs_repair to clean that up..?  Or does
> the problem persist after a repair?

I'm running Debian Woody - the xfs_check/xfs_repair there didn't seem to
find anything last I tried. I have not re-checked for this last problem
though.

I figured I might need to run the CVS version of xfs tools, and, well,
me being busy and all, I thought I'd just leave the 'delete_me'
directory hanging until some time I got more time on my hands  ;)

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: XFS: inode with st_mode == 0

2005-01-17 Thread Jakob Oestergaard
On Sun, Jan 16, 2005 at 01:51:12PM +, Christoph Hellwig wrote:
> On Fri, Jan 14, 2005 at 07:23:09PM +0100, Jakob Oestergaard wrote:
> > So apart from the general well known instability problems that will
> > occur when you actually start *using* the system, there should be no
> 
> What known instabilities?

Where should I begin?  ;)

Most of the following have already been posted to LKML - primarily by
Anders ([EMAIL PROTECTED]) - it seems that noone cares, but I'll repost a
summary that Anders sent me below:

---
Scenario 1: Mailservers:
  2.6.10 (~24-40 hours uptime):
  Running ext3 on mailqueue:


Unable to handle kernel NULL pointer dereference at virtual address 0004
printing eip:
c018a095
*pde = 
Oops: 0002 [#1]
SMP
Modules linked in: nfs e1000 iptable_nat ipt_connlimit rtc
CPU:2
EIP:0060:[]Not tainted
EFLAGS: 00010286   (2.6.8.1)
EIP is at journal_commit_transaction+0x535/0x10e5
eax: cac1e26c   ebx:    ecx: f7cec400   edx: f7cec400
esi: f65f3000   edi: cac1e26c   ebp: f65f3000   esp: f65f3dc0
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 174, threadinfo=f65f3000 task=c2308b70)
Stack: f65f3e64      f7cec400 cda565fc
   149a 0004 f65f3e48 c01132d8 0002 c202ad20 0001 f65f3e5c
   c202ad20 c202ad20 0002 0001 001e 01c1af60 f65f3e68 c0407dc0
Call Trace:
 [] scheduler_tick+0x468/0x470
 [] find_busiest_group+0x105/0x310
 [] del_timer_sync+0x7e/0xa0
 [] kjournald+0xbd/0x230
 [] autoremove_wake_function+0x0/0x40
 [] autoremove_wake_function+0x0/0x40
 [] ret_from_fork+0x6/0x14
 [] commit_timeout+0x0/0x10
 [] kjournald+0x0/0x230
 [] kernel_thread_helper+0x5/0x18
Code: f0 ff 43 04 8b 03 83 e0 04 74 4c 8b 8c 24 b8 01 00 00 c6 81
 <2>SoftDog: Initiating system reboot


---
Scenario 2: Mailservers:
  Running XFS on mailqueue:


Filesystem "sdb1": xfs_trans_delete_ail: attempting to delete a log item that 
is not in the AIL
xfs_force_shutdown(sdb1,0x8) called from line 382 of file 
fs/xfs/xfs_trans_ail.c.  Return address = 0xc0216a56
@Linux version 2.6.9 ([EMAIL PROTECTED]) (gcc version 2.96 2731 (Red 
Hat Linux 7.3 2.96-113)) #1 SMP Tue Oct 19 16:04:55 CEST 2004



===
Resolution to the mailserver problem:
 2.4.28 is perfectly stable on these machines.

---
Scenario 3: Webservers:

  2.6.10, 2.6.10-ac8 (~3-12 hours uptime):


Unable to handle kernel paging request
<2>SoftDog: Initiating system reboot.

(No more...) :(

===
Resolution to the webserver problem:
 2.4.28/2.4.29-rc2 are stable here

---
Scenario 4: Storageservers: 
  2.6.8.1:
Oopses after ~5-10 hours whith SMP on. - Cannot find the actual Oopses 
anymore and 2.6.8+ havent been tested as we cannot afford anymore downtime on 
these servers.


===
Resolution to the storage server problem:
 2.6.8.1 UP is stable (but oopses regularly after memory allocation
 failures)



Hardware on all servers: IBM x335 and x345.

Mentioned errors seen on a total of 17 servers.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: raid 1 - automatic 'repair' possible?

2005-01-19 Thread Jakob Oestergaard
On Wed, Jan 19, 2005 at 11:48:52AM +0100, Kiniger wrote:
...
> some random thoughts:
> 
> nowadays hardware sector sizes are much bigger than 512 bytes

No :)

> and
> the read error may affect some sectors +- the sector which actually
> returned the error.

That's right

> 
> to keep the handling in userspace as much as possible: 
> 
> the real problem is the long resync time. therefore it would
> be sufficient to have a concept of "defective areas" per partition
> and drive (a few of them, perhaps four or so , would be enough) 
> which will be excluded from reads/writes and some means to
> re-synchronize these "defective areas" from the good counterparts
> of the other disk. This would avoid having the whole partition being
> marked as defective.

I wonder if it's really worth it.

The original idea has some merit I think - but what you're suggesting
here is almost "bad block remapping" with transparent recovery and user
space policy agents etc. etc.

If a drive has problems reading the platter, it can usually be corrected
by overwriting the given sector (either the drive can actually overwrite
the sector in place, or it will re-allocate it with severe read
performance penalties following). But there's a reason why that sector
went bad, and you realy want to get the disk replaced.

I think the current policy of marking the disk as failed when it has
failed is sensible.

Just my 0.02 Euro

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Jakob Oestergaard
On Thu, Aug 30, 2007 at 10:16:37PM -0700, Linus Torvalds wrote:
> 
...
> > Why aren't we doing that for any other filesystem than NFS?
> 
> How hard is it to acknowledge the following little word:
> 
>   "regression"
> 
> It's simple. You broke things. You may want to fix them, but you need to 
> fix them in a way that does not break user space.

Trond has a point Linus.

What he "broke" is, for example, a ro mount being mounted as rw.

That *could* be a very serious security (etc.etc.) problem which he just fixed.
Anything depending on read-only not being enforced will cease to work, of
course, and that is what a few people complain about(!).

If ext3 in some rare case (which would still mean it hit a few thousand users)
failed to remember that a file had been marked read-only and allowed writes to
it, wouldn't we want to fix that too?  It would cause regressions, but we'd fix
it, right?

mount passes back the error code on a failed mount. autofs passes that error
along too (when people configure syslog correctly). In short; when these
serious mistakes are made and caught, the admin sees an error in his logs.

This is not wrong. This is good.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Jakob Oestergaard
On Fri, Aug 31, 2007 at 01:07:56AM -0700, Linus Torvalds wrote:
...
> When we add NEW BEHAVIOUR, we don't add it to old interfaces when that 
> breaks old user mode! We add a new flag saying "I want the new behaviour".
> 
> This is not rocket science, guys. This is very basic kernel behaviour. The 
> kernel exists only to serve user space, and that means that there is no 
> more important thing to do than to make sure you don't break existing 
> users, unless you have some *damns* strong reasons.

100% agreed.

> The fact that he may *also* have broken insane setups is totally 
> irrelevant. Don't go off on some tangent that has nothing to do with the 
> regression in question!

It does not have "nothing" to do with the regression.

Some setups which worked more by accident than by design earlier on were broken
by the fix. This could have been avoided, I agree, but the breakage was caused
by the fix (or the breakage is the fix, however you prefer to look at it).

> > If ext3 in some rare case (which would still mean it hit a few thousand 
> > users)
> > failed to remember that a file had been marked read-only and allowed writes 
> > to
> > it, wouldn't we want to fix that too?  It would cause regressions, but we'd 
> > fix
> > it, right?
> 
> Stop blathering. Of course we fix security holes. But we don't break 
> things that don't need breaking. This wasn't a security hole.

*part* of it wasn't a security hole.

The other half very much was.

...
> In other words, it should (as I already mentioned once) have used 
> "nosharecache" by default, which makes it all work.
> 
> Then, people who want to re-use the caches (which in turn may mean that 
> everything needs to have the same flags), THOSE PEOPLE, who want the NEW 
> SEMANTICS (errors and all) should then use a "sharecache" flag.
> 
> See? You don't have to screw people over.

Sure, given that Trond (or whomever) has the time it takes to go and implement
all of this, there's no need to screw anyone.

Assuming he's on a schedule and this will have to wait, I agree with him that
it makes the most sense to play it safe security/consistency-wise rather than
functionality-wise.

> > mount passes back the error code on a failed mount. autofs passes that error
> > along too (when people configure syslog correctly). In short; when these
> > serious mistakes are made and caught, the admin sees an error in his logs.
> 
> Bullshit. "Seeing the error in his logs" doesn't help anything.

It makes troubleshooting possible, which adresses *the* major complaint from
*one* of the *two* people who complained about this.


-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-09-03 Thread Jakob Oestergaard
On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote:
...
> This is *not* a security hole. In order to make it a security hole, you 
> need to be root in the first place.

Non-root users can write to places where root might believe they cannot write
because he might be under the mistaken assumption that ro means ro.

I am under the impression that that could have implications in some setups.

...
> 
>  - it's a misfeature that people are used to, and has been around forever.

Sure, they're used it it, but I doubt they are aware of it.

...
> so I really don't see why people excuse the new behaviour.

We can certainly agree that a nicer fix would be nicer :)

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 09:28:05AM +0200, Ingo Molnar wrote:
> 
> * Alan Cox <[EMAIL PROTECTED]> wrote:
> 
> > >  Can you give examples of backup solutions that rely on atime being 
> > > updated? I can understand backup tools using mtime/ctime for 
> > > incremental backups (like tar + Amanda, etc), but I'm having trouble 
> > > figuring out why someone would want to use atime for that.
> > 
> > HSM is the usual one, and to a large extent probably why Unix 
> > originally had atime. Basically migrating less used files away so as 
> > to keep the system disks tidy.
> 
> atime is used as a _hint_, at most and HSM sure works just fine on an 
> atime-incapable filesystem too. So it's the same deal as "add user_xattr 
> mount option to the filesystem to make Beagle index faster". It's now: 
> "if you use HSM storage add the atime mount option to make it slightly 
> more intelligent. Expect huge IO slowdowns though."
> 
> The only remotely valid compatibility argument would be Mutt - but even 
> that handles it just fine. (we broke way more software via noexec)

I find it pretty normal to use tmpreaper to clear out unused files from
certain types of semi-temporary directory structures. Those files are
often only ever read. They'd start randomly disappearing while in use.

But then again, maybe I'm the only guy on the planet who uses tmpreaper.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 06:42:30AM -0400, Jeff Garzik wrote:
...
> If you can show massive amounts of users that will actually be 
> negatively impacted, please present hard evidence.
> 
> Otherwise all this is useless hot air.

Peace Jeff  :)

In another mail, I gave an example with tmpreaper clearing out unused
files; if some of those files are only read and never modified,
tmpreaper would start deleting files which were still frequently used.

That's a regression, the way I see it. As for 'massive amounts of
users', well, tmpreaper exists in most distros, so it's possible it has
other users than just me.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sat, Aug 04, 2007 at 02:08:40PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> >The "relatime" thing that David mentioned might well be very useful, but 
> >it's probably even less used than "noatime" is. And sadly, I don't really 
> >see that changing (unless we were to actually change the defaults inside 
> >the kernel).
> 
> 
> I actually vote for that.  IMO, distros should turn -on- atime updates 
> when they know its needed.

Oh dear.

Why not just make ext3 fsync() a no-op while you're at it?

Distros can turn it back on if it's needed...

Of course I'm not serious, but like atime, fsync() is something one
expects to work if it's there.  Disabling atime updates or making
fsync() a no-op will both result in silent failure which I am sure we
can agree is disasterous.

Why on earth would you cripple the kernel defaults for ext3 (which is a
fine FS for boot/root filesystems), when the *fundamental* problem you
really want to solve lie much deeper in the implementation of the
filesystem?  Noatime doesn't solve the problem, it just makes it "less
horrible".

If you really need different filesystem performance characteristics, you
can switch to another filesystem. There's plenty to choose from.

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-05 Thread Jakob Oestergaard
On Sun, Aug 05, 2007 at 02:46:48PM +0200, Ingo Molnar wrote:
> 
> * Jakob Oestergaard <[EMAIL PROTECTED]> wrote:
> 
> > > If you can show massive amounts of users that will actually be 
> > > negatively impacted, please present hard evidence.
> > > 
> > > Otherwise all this is useless hot air.
> > 
> > Peace Jeff :)
> > 
> > In another mail, I gave an example with tmpreaper clearing out unused 
> > files; if some of those files are only read and never modified, 
> > tmpreaper would start deleting files which were still frequently used.
> > 
> > That's a regression, the way I see it. As for 'massive amounts of 
> > users', well, tmpreaper exists in most distros, so it's possible it 
> > has other users than just me.
> 
> you mean tmpwatch?

Same same.

> The trivial change below fixes this. And with that 
> we've come to the end of an extremely short list of atime dependencies.

Please read what I wrote, not what you think I wrote.

If I only *read* those files, the mtime will not be updated, only the
atime.

And the files *will* then magically begin to disappear although they are
frequently used.

That will happen with a standard piece of software in a standard
configuration, in a scenario that may or may not be common... I have no
idea how common such a setup is - but I know how much it would suck to
have files magically disappearing because of a kernel upgrade  :)

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser(2)

2007-11-10 Thread Jakob Oestergaard
...
> I've double-checked the code for any possible off-by-one/overflow 
> errors.
...

Two things caught my eye.

...
> + case bol:
> + case subject:
> + if (*label_len >= SMK_MAXLEN)
> + goto out;
> + subjectstr[(*label_len)++] = data[i];

Why is the '>' necessary?  Could it happen that you had incremented past the
point of equality?

If that could not happen, then in my oppinion '>=' is very misleading when '=='
is really what is needed.

...
> + case object:
> + if (*prevstate == blank) {
> + subjectstr[*label_len] = '\0';
> + *label_len = 0;
> + }

I wonder why it is valid to uncritically use the already incremented label_len
here, without checking its value (like is done above).

It seems strangely asymmetrical. I'm not saying it's wrong, because there may
be a subtle reason as to why it's not, but if that's the case then I think that
subtle reason should be documented with a comment.

...
> + case access:
> + if (*prevstate == blank) {
> + objectstr[*label_len] = '\0';
> + *label_len = 0;
> + }

Same applies here.


-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
...
> 
> This bug causes backup systems to *miss* changed files.
> 

This problem is seen with both Amanda and TSM (Tivoli Storage Manager).

A site running Amanda with, say, a full backup weekly and incremental backups
daily, will only get weekly backups of their mmap modified databases.

However, large sites running TSM will be hit even harder by this because TSM
will always perform incremental backups from the client (managing which
versions to keep for how long on the server side). TSM will *never* again take
a backup of the mmap modified database.

The really nasty part is; nothing is failing. Everything *appears* to work.
Your data is just not backed up because it appears to be untouched.

So, if you run TSM (or pretty much any other backup solution actually) on
Linux, maybe you should run a
 find / -type f -print0 | xargs -0 touch
before starting your backup job. Sort of removes the point of using proper
backup software, but at least you'll get your data backed up.


-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:
> Since no reaction in LKML was recieved for this message it seemed
> logical to suggest closing the bug #2645 as "WONTFIX":
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

Thank you!

A quick run-down for those who don't know what this is about:

Some applications use mmap() to modify files. Common examples are databases.

Linux does not update the mtime of files that are modified using mmap, even if
msync() is called.

This is very clearly against OpenGroup specifications.

This misfeatures causes such files to be silently *excluded* from normal backup
runs.

Solaris implements this properly.

NT has the same bug as Linux, using their private bastardisation of the mmap
interface - but since they don't care about SuS and are broken in so many other
ways, that really doesn't matter.


So, dear kernel developers, can we please integrate this patch to make Linux
stop silently excluding peoples databases from their backup?

-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Jakob Oestergaard
On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
...
> > 
> > Lather, rinse, repeat

Just verified this at one customer site; they had a db that was last backed up
in 2003 :/

> 
> On the other hand, updating the mtime and ctime whenever a page is dirtied
> also does not work right.  Apparently that can break mutt.
> 

Thinking back on the atime discussion, I bet there would be some performance
problems in updating the ctime/mtime that often too :)

> Calling msync() every once in a while with Anton's patch does not look like a
> fool proof method to me either, because the VM can write all the dirty pages
> to disk by itself, leaving nothing for msync() to detect.  (I think...)

Reading the man page:
"The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and
PROT_WRITE will be marked for update at some point in the interval between a
write reference to the mapped region and the next call to  msync() with
MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no
such call, these fields may be marked for update at any time after a write
reference if the underlying file is modified as a result."

So, whenever someone writes in the region, this must cause us to update the
mtime/ctime no later than at the time of the next call to msync().

Could one do something like the lazy atime updates, coupled with a forced flush
at msync()?

> Can we get by with simply updating the ctime and mtime every time msync()
> is called, regardless of whether or not the mmaped pages were still dirty
> by the time we called msync() ?

The update must still happen, eventually, after a write to the mapped region
followed by an unmap/close even if no msync is ever called.

The msync only serves as a "no later than" deadline. The write to the region
triggers the need for the update.

At least this is how I read the standard - please feel free to correct me if I
am mistaken.

-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Jakob Oestergaard
On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:
...
> > I guess a third possible time (if we want to minimize the number of
> > updates) would be when natural syncing of the file data to disk, by
> > other things in the VM, would be about to clear the I_DIRTY_PAGES
> > flag on the inode.  That way we do not need to remember any special
> > "we already flushed all dirty data, but we have not updated the mtime
> > and ctime yet" state.
> >
> > Does this sound reasonable?
> 
> No, it doesn't. The msync() system call called with the MS_ASYNC flag
> should (the POSIX standard requires that) update the st_ctime and
> st_mtime stamps in the same manner as for the MS_SYNC flag. However,
> the current implementation of msync() doesn't call the do_fsync()
> function for the MS_ASYNC case. The msync() function may be called
> with the MS_ASYNC flag before "natural syncing".

If the update was done as Rik suggested, with the addition that msync()
triggered an explicit sync of the inode data, then everything would be ok,
right?

-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Thu, Jan 17, 2008 at 01:25:39PM -0800, Linus Torvalds wrote:
...
> Why do you make that mistake, when it is PROVABLY NOT TRUE!
> 
> Try this trivial program:
> 
>   int main(int argc, char **argv)
>   {
>   int i;
>   const int *c;
>   
>   i = 5;
>   c = 
>   i = 10;
>   return *c;
>   }
> 
> and realize that according to the C rules, if it returns anything but 10, 
> the compiler is *buggy*.

That's not how this works (as we obviously agree).

Please consider a rewrite of your example, demonstrating the usefulness and
proper application of const pointers:

extern foo(const int *);

int main(int argc, char **argv)
{
 int i;

 i = 5;
 foo();
 return i;
}

Now, if the program returns anything else than 5, it means someone cast away
const, which is generally considered a bad idea in most other software
projects, for this very reason.

*That* is the purpose of const pointers.

Besides, for most debugging-enabled free() implementations, free() does indeed
touch the memory pointed to by its argument, which makes giving it a const
pointer completely bogus except for a single potential optimized special-case
where it might actually not touch the memory.

-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Fri, Jan 18, 2008 at 02:31:16PM +0100, Björn Steinbrink wrote:
...
> 
> Do you see anything that casts the const away? No? Me neither. Still,
> the memory that p points to was changed, because there was another
> pointer and that was not const.

*another* being key here.

> 
> > *That* is the purpose of const pointers.
> 
> The only thing that const can tell you is that you should not modify the
> value _yourself_, using that pointer _directly_.

Which is pretty damn useful.

Think about it.  Don't you ever use const?  Is it ever only in the way?

...
{snip long explanation about how one can avoid the benefits of const, without
using casts}
...
> If you want to restrict the set of pointers that can be invalidated by
> an other pointer, you'll have to use something else because const does
> not talk about invalidating aliasing pointers.

Precisely, so why are we discussing this?

I claim that const is useful. You claim that it can't solve all the worlds
problems. I agree with that, but I maintain it is still useful.

But, in order for it to be useful, it requires that people do not circumvent it
in the wrong places (such as kfree).

-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is the kfree() argument const?

2008-01-18 Thread Jakob Oestergaard
On Fri, Jan 18, 2008 at 12:47:01PM +0100, Giacomo A. Catenazzi wrote:
...
> "restrict" exists for this reason. const is only about lvalue.

You think that I try to put more meaning into const than I do - but I don't.

Please read what I wrote, not what you want to think I wrote.

I agree that if I said what you seem to imply I said, then I would have been
wrong. But I didn't so I'm not ;)


-- 

 / jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/