Re: SLUB performance regression vs SLAB

2007-10-05 Thread Jens Axboe
On Fri, Oct 05 2007, David Chinner wrote:
 On Thu, Oct 04, 2007 at 03:07:18PM -0700, David Miller wrote:
  From: Chuck Ebbert [EMAIL PROTECTED] Date: Thu, 04 Oct 2007 17:47:48
  -0400
  
   On 10/04/2007 05:11 PM, David Miller wrote:
From: Chuck Ebbert [EMAIL PROTECTED] Date: Thu, 04 Oct 2007 17:02:17
-0400

How do you simulate reading 100TB of data spread across 3000 disks,
selecting 10% of it using some criterion, then sorting and summarizing
the result?

You repeatedly read zeros from a smaller disk into the same amount of
memory, and sort that as if it were real data instead.
   
   You've just replaced 3000 concurrent streams of data with a single stream.
   That won't test the memory allocator's ability to allocate memory to many
   concurrent users very well.
  
  You've kindly removed my thinking outside of the box comment.
  
  The point is was not that my specific suggestion would be perfect, but that
  if you used your creativity and thought in similar directions you might find
  a way to do it.
  
  People are too narrow minded when it comes to these things, and that's the
  problem I want to address.
 
 And it's a good point, too, because often problems to one person are a
 no-brainer to someone else.
 
 Creating lots of fake disks is trivial to do, IMO.  Use loopback on
 sparse files containing sparse filesxi, use ramdisks containing sparse
 files or write a sparse dm target for sparse block device mapping,
 etc. I'm sure there's more than the few I just threw out...

Or use scsi_debug to fake drives/controllers, works wonderful as well
for some things and involve the full IO stack.

I'd like to second Davids emails here, this is a serious problem. Having
a reproducible test case lowers the barrier for getting the problem
fixed by orders of magnitude. It's the difference between the problem
getting fixed in a day or two and it potentially lingering for months,
because email ping-pong takes forever and the test team has moved on to
other tests, we'll let you know the results of test foo in 3 weeks time
when we have a new slot on the box just removing any developer
motivation to work on the issue.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Pekka Enberg
Hi,

On 10/5/07, Jens Axboe [EMAIL PROTECTED] wrote:
 I'd like to second Davids emails here, this is a serious problem. Having
 a reproducible test case lowers the barrier for getting the problem
 fixed by orders of magnitude. It's the difference between the problem
 getting fixed in a day or two and it potentially lingering for months,
 because email ping-pong takes forever and the test team has moved on to
 other tests, we'll let you know the results of test foo in 3 weeks time
 when we have a new slot on the box just removing any developer
 motivation to work on the issue.

What I don't understand is that why don't the people who _have_ access
to the test case fix the problem? Unlike slab, slub is not a pile of
crap that only Christoph can hack on...

   Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Jens Axboe
On Fri, Oct 05 2007, Pekka Enberg wrote:
 Hi,
 
 On 10/5/07, Jens Axboe [EMAIL PROTECTED] wrote:
  I'd like to second Davids emails here, this is a serious problem. Having
  a reproducible test case lowers the barrier for getting the problem
  fixed by orders of magnitude. It's the difference between the problem
  getting fixed in a day or two and it potentially lingering for months,
  because email ping-pong takes forever and the test team has moved on to
  other tests, we'll let you know the results of test foo in 3 weeks time
  when we have a new slot on the box just removing any developer
  motivation to work on the issue.
 
 What I don't understand is that why don't the people who _have_ access
 to the test case fix the problem? Unlike slab, slub is not a pile of
 crap that only Christoph can hack on...

Often the people testing are only doing just that, testing. So they
kindly offer to test any patches and so on, which usually takes forever
because of the above limitations in response time, machine availability,
etc.

Writing a small test module to exercise slub/slab in various ways
(allocating from all cpus freeing from one, as described) should not be
too hard. Perhaps that would be enough to find this performance
discrepancy between slab and slub?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/32] IGET: Stop EXT2 from using iget() and read_inode() [try #2]

2007-10-05 Thread Theodore Tso
On Thu, Oct 04, 2007 at 04:57:08PM +0100, David Howells wrote:
 Stop the EXT2 filesystem from using iget() and read_inode().  Replace
 ext2_read_inode() with ext2_iget(), and call that instead of iget().
 ext2_iget() then uses iget_locked() directly and returns a proper error code
 instead of an inode in the event of an error.
 
 ext2_fill_super() returns any error incurred when getting the root inode
 instead of EINVAL.
 
 Signed-off-by: David Howells [EMAIL PROTECTED]

Acked-by: Theodore Ts'o [EMAIL PROTECTED]

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] isofs: add +w bit for non-RR discs

2007-10-05 Thread Jan Engelhardt

On Oct 4 2007 20:46, Matthew Wilcox wrote:
On Tue, Oct 02, 2007 at 08:00:26PM +0200, Jan Engelhardt wrote:
 Add %S_IWUGO bit for files on ISO-9660 filesystems without RockRidge

Looks to me like you've added S_IWUSR, not S_IWUGO.

Yes, S_IWUSR it should be, and is. When a user copies such a file,
the new file will be owned by him/her, so the +w bit needs to
be S_IWUSR indeed.

 -popt-mode = S_IRUGO | S_IXUGO; /*
 +popt-mode = S_IRUGO | S_IWUSR | S_IXUGO;
 -inode-i_mode = S_IRUGO | S_IXUGO | S_IFDIR;
 +inode-i_mode = S_IRUGO | S_IWUSR | S_IXUGO | S_IFDIR;

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Matthew Wilcox
On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
 I'd like to second Davids emails here, this is a serious problem. Having
 a reproducible test case lowers the barrier for getting the problem
 fixed by orders of magnitude. It's the difference between the problem
 getting fixed in a day or two and it potentially lingering for months,
 because email ping-pong takes forever and the test team has moved on to
 other tests, we'll let you know the results of test foo in 3 weeks time
 when we have a new slot on the box just removing any developer
 motivation to work on the issue.

I vaguely remembered something called orasim, so I went looking for it.
I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
seems to be a stillborn project.  Is there anything else I should know
about orasim?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Jens Axboe
On Fri, Oct 05 2007, Matthew Wilcox wrote:
 On Fri, Oct 05, 2007 at 08:48:53AM +0200, Jens Axboe wrote:
  I'd like to second Davids emails here, this is a serious problem. Having
  a reproducible test case lowers the barrier for getting the problem
  fixed by orders of magnitude. It's the difference between the problem
  getting fixed in a day or two and it potentially lingering for months,
  because email ping-pong takes forever and the test team has moved on to
  other tests, we'll let you know the results of test foo in 3 weeks time
  when we have a new slot on the box just removing any developer
  motivation to work on the issue.
 
 I vaguely remembered something called orasim, so I went looking for it.
 I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
 seems to be a stillborn project.  Is there anything else I should know
 about orasim?  ;-)

I don't know much about orasim, except that internally we're trying to
use fio for that instead. As far as I know, it was a project that was
never feature complete (or completed all together, for that matter).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Andi Kleen
Jens Axboe [EMAIL PROTECTED] writes:
 
 Writing a small test module to exercise slub/slab in various ways
 (allocating from all cpus freeing from one, as described) should not be
 too hard. Perhaps that would be enough to find this performance
 discrepancy between slab and slub?

You could simulate that by just sending packets using unix sockets 
between threads bound to different CPUs. Sending a packet allocates; receiving 
deallocates.

But it's not clear that will really simulate the cache bounce environment
of the database test. I don't think all passing of data between CPUs 
using slub objects is slow.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/32] Remove iget() and read_inode() [try #2]

2007-10-05 Thread David Howells
David Howells [EMAIL PROTECTED] wrote:

 A tarball of the patches can be retrieved from:
 
   http://people.redhat.com/~dhowells/iget-remove.tar.bz2

I've updated that to actually reflect this set of patches and not the old set.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/32] IGET: Introduce a function to register iget failure [try #2]

2007-10-05 Thread Christoph Hellwig
On Thu, Oct 04, 2007 at 04:56:17PM +0100, David Howells wrote:
 Introduce a function to register failure in an inode construction path.  This
 includes marking the inode under construction as bad, unlocking it and
 releasing it.

I'm a bit unconfortable with the name, but except for that this function
and the users added in the next patches look fine.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/32] IGET: Mark iget() and read_inode() as being obsolete [try #2]

2007-10-05 Thread Christoph Hellwig
Why do you move it out of line?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/32] IGET: Mark iget() and read_inode() as being obsolete [try #2]

2007-10-05 Thread David Howells
David Howells [EMAIL PROTECTED] wrote:

  Why do you move it out of line?
 
 Because otherwise every file that gets compiled that includes linux/fs.h will
 emit a warning that the implementation of that function is deprecated.

Or, rather, that read_inode() is deprecated.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/32] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #2]

2007-10-05 Thread Christoph Hellwig
On Thu, Oct 04, 2007 at 04:56:07PM +0100, David Howells wrote:
 Add an ERR_CAST() macro to complement ERR_PTR and co. for the purposes of
 casting an error entyped as one pointer type to an error of another pointer
 type whilst making it explicit as to what is going on.
 
 This provides a replacement for the ERR_PTR(PTR_ERR(p)) construct.

I don't like this one very much, I'd rather take the variant Linus
outline in his last post on that topic..

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/32] IGET: Mark iget() and read_inode() as being obsolete [try #2]

2007-10-05 Thread David Howells
Christoph Hellwig [EMAIL PROTECTED] wrote:

 Why do you move it out of line?

Because otherwise every file that gets compiled that includes linux/fs.h will
emit a warning that the implementation of that function is deprecated.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Case-insensitive support for XFS

2007-10-05 Thread Christoph Hellwig
[Adding -fsdevel because some of the things touched here might be of
 broader interest and Urban because his name is on nls_utf8.c]

On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
 
 On it's own, linux only provides case conversion for old-style
 character sets - 8 bit sequences only. A lot of distos are
 now defaulting to UTF-8 and Linux NLS stuff does not support
 case conversion for any unicode sets.

The lack of case tables in nls_utf8.c defintively seems odd to me.
Urban, is there a reason for that?  The only thing that comes to
mind is that these tables might be quite large.

 NTFS in Linux also implements it's own dcache and NTFS also

^^^ dentry operations?

 stores its unicode case table on disk. This allows the filesystem
 to migrate to newer forms of Unicode at the time of formatting
 the filesystem. Eg. Windows Vista now supports Unicode 5.0
 while older version would support an earlier version of
 Unicode. Linux's version of NTFS case table is implemented
 in fs/ntfs/upcase.c defined as default_upcase.

Because ntfs uses 16bit wide chars it prefers to use it's own tables.
I'm not sure it's a that good idea.  JFS also has wide-char names on
disk but at least partially uses the generic nls support, so there must
be some trade-offs.

 It will be proposed that in the future, XFS may default to
 UTF-8 on disk and to go for the old format, explicitily
 use a mkfs.xfs option. Two superbits will be used: one for
 case-insensitive (which generates lowercase hashes on disk)
 and that already exists on IRIX filesystems and a new one
 for UTF-8 filenames. Any combination of the two bits can be
 used and the dentry_operations will be adjusted accordingly.

I don't think arbitrary combinations make sense.  Without case insensitive
support a unix filesystem couldn't care less what charset the filenames
are in, except for the terminating 0 and '/', '.', '..' it's an entirely
opaqueue stream of bytes.  So chosing a charset only makes sense
with the case insensitive filename option.

 So, in regards to the UTF-8 case-conversion/folding table, we
 have several options to choose from:
- Use the HFS+ method as-is.
- Use an NTFS scheme with an on-disk table.
- Pick a current table and stick with it (similar to HFS+).
- How much of Unicode to we support? Just the the Basic
  Multilingual Plane (U+ - U+) or the entire set?
  (anything above U+ won't have case-conversion
   requirements). Seems that all the other filesystems
   just support the BMP.
- UTF-8, UTF-16 or UCS-2.
 
 With the last point, UTF-8 has several advantages IMO:
- xfs_repair can easily detect UTF-8 sequences in filenames
  and also validate UTF-8 sequences.
- char based structures don't change
- nulls in filenames.
- no endian conversions required.

I think the right approach is to use the fs/nls/ code and allow the
user to select any table with a mount option as at least in russia
and eastern europe some non-utf8 charsets still seem to be prefered.
The default should of course be utf8 and support for utf8 case
conversion should be added to fs/nls/

 Internally, the names will probably be converted to u16s for
 efficient processing. Conversion between UTF-8 and UTF-16/UCS-2
 is very straight forward.

Do we really need that?  And if so please make sure this only happens
for filesystems created with the case insensitivity option so normal
filesystems don't have to pay for these bloated strings.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/32] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #2]

2007-10-05 Thread David Howells
Christoph Hellwig [EMAIL PROTECTED] wrote:

 I don't like this one very much, I'd rather take the variant Linus
 outline in his last post on that topic..

I'd rather not take that variant.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/32] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #2]

2007-10-05 Thread Sam Ravnborg
On Fri, Oct 05, 2007 at 04:47:36PM +0100, David Howells wrote:
 Christoph Hellwig [EMAIL PROTECTED] wrote:
 
  I don't like this one very much, I'd rather take the variant Linus
  outline in his last post on that topic..
 
 I'd rather not take that variant.

One like one variant, another like another variant.
Both fails to say why.

So you left everyone wondering.

Sam
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/32] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #2]

2007-10-05 Thread David Howells
Sam Ravnborg [EMAIL PROTECTED] wrote:

 One like one variant, another like another variant.
 Both fails to say why.

It's shorter, more compact, and of course requires the least amount of change
(especially as I can automate it with a command line perl script).

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/32] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #2]

2007-10-05 Thread Randy Dunlap
On Thu, 04 Oct 2007 16:56:07 +0100 David Howells wrote:

 Add an ERR_CAST() macro to complement ERR_PTR and co. for the purposes of
 casting an error entyped as one pointer type to an error of another pointer
 type whilst making it explicit as to what is going on.
 
 This provides a replacement for the ERR_PTR(PTR_ERR(p)) construct.
 
 Signed-off-by: David Howells [EMAIL PROTECTED]
 ---
 
  include/linux/err.h |   12 
  1 files changed, 12 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/err.h b/include/linux/err.h
 index 1ab1d44..e810ca4 100644
 --- a/include/linux/err.h
 +++ b/include/linux/err.h
 @@ -34,6 +34,18 @@ static inline long IS_ERR(const void *ptr)
   return IS_ERR_VALUE((unsigned long)ptr);
  }
  
 +/**
 + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer type
 + * ptr: The pointer to cast.

  @ptr:

 + *
 + * Explicitly cast an error-valued pointer to another pointer type in such a
 + * way as to make it clear that's what's going on.
 + */
 +static inline void *ERR_CAST(const void *ptr)
 +{
 + return (void *) ptr;
 +}
 +
  #endif
  
  #endif /* _LINUX_ERR_H */


---
~Randy
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Case-insensitive support for XFS

2007-10-05 Thread Nicholas Miell
On Fri, 2007-10-05 at 16:44 +0100, Christoph Hellwig wrote:
 [Adding -fsdevel because some of the things touched here might be of
  broader interest and Urban because his name is on nls_utf8.c]
 
 On Fri, Oct 05, 2007 at 11:57:54AM +1000, Barry Naujok wrote:
  
  On it's own, linux only provides case conversion for old-style
  character sets - 8 bit sequences only. A lot of distos are
  now defaulting to UTF-8 and Linux NLS stuff does not support
  case conversion for any unicode sets.
 
 The lack of case tables in nls_utf8.c defintively seems odd to me.
 Urban, is there a reason for that?  The only thing that comes to
 mind is that these tables might be quite large.
 

Case conversion in Unicode is locale dependent. The legacy 8-bit
character encodings don't code for enough characters to run into the
ambiguities, so they can get away with fixed case conversion tables.
Unicode can't.

I'd point you to the Unicode technical report which explains how to do
it, but unicode.org seems to be offline right now.

  NTFS in Linux also implements it's own dcache and NTFS also
 
   ^^^ dentry operations?
 
  stores its unicode case table on disk. This allows the filesystem
  to migrate to newer forms of Unicode at the time of formatting
  the filesystem. Eg. Windows Vista now supports Unicode 5.0
  while older version would support an earlier version of
  Unicode. Linux's version of NTFS case table is implemented
  in fs/ntfs/upcase.c defined as default_upcase.
 
 Because ntfs uses 16bit wide chars it prefers to use it's own tables.
 I'm not sure it's a that good idea.  

Well, Windows uses those on-disk tables, so the Linux driver has to
also. I don't see how that's a bad idea or any way to not do it and
remain compatible.

-- 
Nicholas Miell [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Christoph Lameter
On Fri, 5 Oct 2007, Matthew Wilcox wrote:

 I vaguely remembered something called orasim, so I went looking for it.
 I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
 2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
 seems to be a stillborn project.  Is there anything else I should know
 about orasim?  ;-)

Too bad. If this would work then I would have a load to work against. I 
have a patch here that may address the issue for SMP (no NUMA for now) by 
batching all frees on the per cpu freelist and then dumping them in 
groups. But it is likely not too wise to have you run your weeklong 
tests on this one. Needs some more care first.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Christoph Lameter
On Fri, 5 Oct 2007, Jens Axboe wrote:

 It might not, it might. The point is trying to isolate the problem and
 making a simple test case that could be used to reproduce it, so that
 Christoph (or someone else) can easily fix it.

In case there is someone who wants to hack on it: Here is what I got so 
far for batching the frees. I will try to come up with a test next week if 
nothing else happens before:

Patch 1/2 on top of mm:

SLUB: Keep counter of remaining objects on the per cpu list

Add a counter to keep track of how many objects are on the per cpu list.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slub_def.h |1 +
 mm/slub.c|8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h  2007-10-04 
22:41:58.0 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h   2007-10-04 
22:42:08.0 -0700
@@ -15,6 +15,7 @@ struct kmem_cache_cpu {
void **freelist;
struct page *page;
int node;
+   int remaining;
unsigned int offset;
unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===
--- linux-2.6.23-rc8-mm2.orig/mm/slub.c 2007-10-04 22:41:58.0 -0700
+++ linux-2.6.23-rc8-mm2/mm/slub.c  2007-10-04 22:42:08.0 -0700
@@ -1386,12 +1386,13 @@ static void deactivate_slab(struct kmem_
 * because both freelists are empty. So this is unlikely
 * to occur.
 */
-   while (unlikely(c-freelist)) {
+   while (unlikely(c-remaining)) {
void **object;
 
/* Retrieve object from cpu_freelist */
object = c-freelist;
c-freelist = c-freelist[c-offset];
+   c-remaining--;
 
/* And put onto the regular freelist */
object[c-offset] = page-freelist;
@@ -1491,6 +1492,7 @@ load_freelist:
 
object = c-page-freelist;
c-freelist = object[c-offset];
+   c-remaining = s-objects - c-page-inuse - 1;
c-page-inuse = s-objects;
c-page-freelist = NULL;
c-node = page_to_nid(c-page);
@@ -1574,13 +1576,14 @@ static void __always_inline *slab_alloc(
 
local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
-   if (unlikely(!c-freelist || !node_match(c, node)))
+   if (unlikely(!c-remaining || !node_match(c, node)))
 
object = __slab_alloc(s, gfpflags, node, addr, c);
 
else {
object = c-freelist;
c-freelist = object[c-offset];
+   c-remaining--;
}
local_irq_restore(flags);
 
@@ -1686,6 +1689,7 @@ static void __always_inline slab_free(st
if (likely(page == c-page  c-node = 0)) {
object[c-offset] = c-freelist;
c-freelist = object;
+   c-remaining++;
} else
__slab_free(s, page, x, addr, c-offset);
 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread Christoph Lameter
Patch 2/2


SLUB: Allow foreign objects on the per cpu object lists.

In order to free objects we need to touch the page struct of the page that the
object belongs to. If this occurs too frequently then we could generate a 
bouncing
cacheline.

We do not want that to occur too frequently. We can avoid the page struct 
touching
for per cpu objects. Now we extend that to allow a limited number of objects 
that are
not part of the cpu slab. Allow up to 4 times the objects that fit into a page
in the per cpu list.

If the objects are allocated before we need to free them then we have saved 
touching
a page struct twice. The objects are presumably cache hot, so it is performance 
wise
good to recycle these locally.

Foreign objects are drained before deactivating cpu slabs and if too many 
objects
accumulate.

For kmem_cache_free() this also has the beneficial effect of getting 
virt_to_page()
operations eliminated or grouped together which may help reduce the cache 
footprint
and increase the speed of virt_to_page() lookups (they hopefully all come from 
the
same pages).

For kfree() we may have to do virt_to_page() in the worst case twice. Once 
grouped
together.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slub_def.h |1 
 mm/slub.c|   82 ++-
 2 files changed, 68 insertions(+), 15 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h  2007-10-04 
22:42:08.0 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h   2007-10-04 
22:43:19.0 -0700
@@ -16,6 +16,7 @@ struct kmem_cache_cpu {
struct page *page;
int node;
int remaining;
+   int drain_limit;
unsigned int offset;
unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===
--- linux-2.6.23-rc8-mm2.orig/mm/slub.c 2007-10-04 22:42:08.0 -0700
+++ linux-2.6.23-rc8-mm2/mm/slub.c  2007-10-04 22:56:49.0 -0700
@@ -187,6 +187,12 @@ static inline void ClearSlabDebug(struct
  */
 #define MAX_PARTIAL 10
 
+/*
+ * How many times the number of objects per slab can accumulate on the
+ * per cpu objects list before we drain it.
+ */
+#define DRAIN_FACTOR 4
+
 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)
 
@@ -1375,6 +1381,54 @@ static void unfreeze_slab(struct kmem_ca
}
 }
 
+static void __slab_free(struct kmem_cache *s, struct page *page,
+   void *x, void *addr, unsigned int offset);
+
+/*
+ * Drain freelist of objects foreign to the slab. Interrupts must be off.
+ *
+ * This is called
+ *
+ * 1. Before taking the slub lock when a cpu slab is to be deactivated.
+ *Deactivation can only deal with native objects on the freelist.
+ *
+ * 2. If the number of objects in the per cpu structures grows beyond
+ *3 times the objects that fit in a slab. In that case we need to throw
+ *some objects away. Stripping the foreign objects does the job and
+ *localizes any new the allocations.
+ */
+static void drain_foreign(struct kmem_cache *s, struct kmem_cache_cpu *c, void 
*addr)
+{
+   void **freelist = c-freelist;
+
+   if (unlikely(c-node  0)) {
+   /* Slow path user */
+   __slab_free(s, virt_to_head_page(freelist), freelist, addr, 
c-offset);
+   freelist = NULL;
+   c-remaining--;
+   }
+
+   if (!freelist)
+   return;
+
+   c-freelist = NULL;
+   c-remaining = 0;
+
+   while (freelist) {
+   void **object = freelist;
+   struct page *page = virt_to_head_page(freelist);
+
+   freelist = freelist[c-offset];
+   if (page == c-page) {
+   /* Local object. Keep for future allocations */
+   object[c-offset] = c-freelist;
+   c-freelist = object;
+   c-remaining++;
+   } else
+   __slab_free(s, page, object, NULL, c-offset);
+   }
+}
+
 /*
  * Remove the cpu slab
  */
@@ -1405,6 +1459,7 @@ static void deactivate_slab(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+   drain_foreign(s, c, NULL);
slab_lock(c-page);
deactivate_slab(s, c);
 }
@@ -1480,6 +1535,7 @@ static void *__slab_alloc(struct kmem_ca
if (!c-page)
goto new_slab;
 
+   drain_foreign(s, c, NULL);
slab_lock(c-page);
if (unlikely(!node_match(c, node)))
goto another_slab;
@@ -1553,6 +1609,7 @@ debug:
c-page-inuse++;
c-page-freelist = object[c-offset];
c-node = -1;
+   c-remaining = s-objects * 64;

Re: SLUB performance regression vs SLAB

2007-10-05 Thread Peter Zijlstra

On Thu, 2007-10-04 at 17:02 -0400, Chuck Ebbert wrote:
 On 10/04/2007 04:55 PM, David Miller wrote:
  
  Anything, I do mean anything, can be simulated using small test
  programs.
 
 How do you simulate reading 100TB of data spread across 3000 disks,
 selecting 10% of it using some criterion, then sorting and summarizing
 the result?

Focus on the slab allocator usage, instrument it, record a trace,
generate a statistical model that matches, and write a small
programm/kernel module that has the same allocation pattern. Then verify
this statistical workload still shows the same performance difference.

Easy: no
Doable: yes



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLUB performance regression vs SLAB

2007-10-05 Thread David Miller
From: Peter Zijlstra [EMAIL PROTECTED]
Date: Fri, 05 Oct 2007 22:32:00 +0200

 Focus on the slab allocator usage, instrument it, record a trace,
 generate a statistical model that matches, and write a small
 programm/kernel module that has the same allocation pattern. Then verify
 this statistical workload still shows the same performance difference.
 
 Easy: no
 Doable: yes

The other important bit is likely to generate a lot of DMA traffic
such that the L2 cache bandwidth is getting used on the bus
side by the PCI controller doing invalidations of both dirty
and clean L2 cache lines as devices DMA to/from them.

This will also be exercising the memory controller, further contending
with the cpu when SLAB touches cold data structures.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html