subject:"\[RFC 00\/10\] implement alternative and much simpler id allocator"

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-23 Thread Matthew Wilcox

From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> Nice work! A few random comments/questions:
> 
> - It does add some complexity, but I think a few comments would make it
>   more digestable.

I'm open to adding some comments ... I need some time between writing the code 
and writing the comments to be sure what comments are useful.

> - Hm, maybe I'm confused, and I certainly don't understand how the whole
>   radix tree works. But do you use every leaf node as an exceptional
>   entry initially, to allocate the first 62 ids from that level? This
>   code

I do!  And that question tells me one useful comment to add!

>   if ((bit + RADIX_TREE_EXCEPTIONAL_SHIFT) <
>   BITS_PER_LONG) {
>   bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
>   radix_tree_iter_replace(root, , slot,
>   (void *)((1UL << bit) |
>   RADIX_TREE_EXCEPTIONAL_ENTRY));
>   *id = new;
>   return 0;
>   }
> 
>operates on bit, which is the offset from index*IDA_BITMAP_BITS, and
>it seems to create an exceptional entry somewhere down the tree
>(which may of course be the root).
> 
>But we don't seem to allocate another bit from that exceptional entry
>ever unless it happened to sit at index 0; the code
> 
>   unsigned long tmp = (unsigned long)bitmap;
>   if (start < BITS_PER_LONG) {
>   unsigned tbit = find_next_zero_bit(,
>   BITS_PER_LONG, start);
>   if (tbit < BITS_PER_LONG) {
>   tmp |= 1UL << tbit;
>   rcu_assign_pointer(*slot, (void *)tmp);
>   *id = new + tbit -
>   RADIX_TREE_EXCEPTIONAL_SHIFT;
>   return 0;
>   }
>   }
> 
>is only used for small values of start (though it does seem to
>account for a non-zero value of new == iter.index * IDA_BITMAP_BITS).

Ahh.  You're reading the code wrong, and I wrote the code wrongly too, so it's 
clearly overly complex.  I was testing with 'start' set to 0, allocating N 
entries, and then freeing them.  If I'd set start to 1024 and allocated two 
entries, I'd've seen the failure.

Please see the top commit here ("Improve IDA exceptional entry handling"): 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-20


> - In the micro-optimization department, I think we should avoid
>   find_next_zero_bit on a single word, and instead use __ffs. Something
>   like (assuming we can use bit instead of start)
> 
>   if (bit + RADIX_TREE_EXCEPTIONAL_SHIFT < BITS_PER_LONG) {
> tmp = (~(unsigned long)bitmap) >> (bit + RADIX_TREE_EXCEPTIONAL_SHIFT);
> if (tmp) {
>   tbit = __ffs(tmp) + bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
>   tmp = (unsigned long)bitmap | (1UL << tbit);
>   rcu_assign_pointer(*slot, (void *)tmp);
>   *id = new + tbit - RADIX_TREE_EXCEPTIONAL_SHIFT;
>   return 0;
> }
>   }

I'm certainly open to microoptimisations, but I'll have to think about this one 
for a bit.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-23 Thread Matthew Wilcox

From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> Nice work! A few random comments/questions:
> 
> - It does add some complexity, but I think a few comments would make it
>   more digestable.

I'm open to adding some comments ... I need some time between writing the code 
and writing the comments to be sure what comments are useful.

> - Hm, maybe I'm confused, and I certainly don't understand how the whole
>   radix tree works. But do you use every leaf node as an exceptional
>   entry initially, to allocate the first 62 ids from that level? This
>   code

I do!  And that question tells me one useful comment to add!

>   if ((bit + RADIX_TREE_EXCEPTIONAL_SHIFT) <
>   BITS_PER_LONG) {
>   bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
>   radix_tree_iter_replace(root, , slot,
>   (void *)((1UL << bit) |
>   RADIX_TREE_EXCEPTIONAL_ENTRY));
>   *id = new;
>   return 0;
>   }
> 
>operates on bit, which is the offset from index*IDA_BITMAP_BITS, and
>it seems to create an exceptional entry somewhere down the tree
>(which may of course be the root).
> 
>But we don't seem to allocate another bit from that exceptional entry
>ever unless it happened to sit at index 0; the code
> 
>   unsigned long tmp = (unsigned long)bitmap;
>   if (start < BITS_PER_LONG) {
>   unsigned tbit = find_next_zero_bit(,
>   BITS_PER_LONG, start);
>   if (tbit < BITS_PER_LONG) {
>   tmp |= 1UL << tbit;
>   rcu_assign_pointer(*slot, (void *)tmp);
>   *id = new + tbit -
>   RADIX_TREE_EXCEPTIONAL_SHIFT;
>   return 0;
>   }
>   }
> 
>is only used for small values of start (though it does seem to
>account for a non-zero value of new == iter.index * IDA_BITMAP_BITS).

Ahh.  You're reading the code wrong, and I wrote the code wrongly too, so it's 
clearly overly complex.  I was testing with 'start' set to 0, allocating N 
entries, and then freeing them.  If I'd set start to 1024 and allocated two 
entries, I'd've seen the failure.

Please see the top commit here ("Improve IDA exceptional entry handling"): 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-20


> - In the micro-optimization department, I think we should avoid
>   find_next_zero_bit on a single word, and instead use __ffs. Something
>   like (assuming we can use bit instead of start)
> 
>   if (bit + RADIX_TREE_EXCEPTIONAL_SHIFT < BITS_PER_LONG) {
> tmp = (~(unsigned long)bitmap) >> (bit + RADIX_TREE_EXCEPTIONAL_SHIFT);
> if (tmp) {
>   tbit = __ffs(tmp) + bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
>   tmp = (unsigned long)bitmap | (1UL << tbit);
>   rcu_assign_pointer(*slot, (void *)tmp);
>   *id = new + tbit - RADIX_TREE_EXCEPTIONAL_SHIFT;
>   return 0;
> }
>   }

I'm certainly open to microoptimisations, but I'll have to think about this one 
for a bit.

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-22 Thread Rasmus Villemoes

On Sat, Dec 17 2016, Matthew Wilcox  wrote:

> From: Matthew Wilcox
>> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
>> > This sounds good. I think there may still be a lot of users that never
>> > allocate more than a handful of IDAs, making a 128 byte allocation still
>> > somewhat excessive. One thing I considered was (exactly as it's done for
>> > file descriptor tables) to embed a single word in the struct ida and
>> > use that initially; I haven't looked closely at newIDA, so I don't know
>> > how easy that would be or if its worth the complexity.
>> 
>> Heh, I was thinking about that too.  The radix tree supports "exceptional
>> entries" which have the bottom bit set.  On a 64-bit machine, we could use 62
>> of the bits in the radix tree root to store the ID bitmap.  I'm a little 
>> wary of the
>> potential complexity, but we should try it out.
>
> Test patch here: 
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
> It passes the test suite ... which I actually had to adjust because it
> now succeeds in cases where it hadn't (allocating ID 0 without
> preallocating), and it will now fail in cases where it hadn't
> previously (assuming a single preallocation would be enough).  There
> shouldn't be any examples of that in the kernel proper; it was simply
> me being lazy when I wrote the test suite.

Nice work! A few random comments/questions:

- It does add some complexity, but I think a few comments would make it
  more digestable.

- Hm, maybe I'm confused, and I certainly don't understand how the whole
  radix tree works. But do you use every leaf node as an exceptional
  entry initially, to allocate the first 62 ids from that level? This
  code

if ((bit + RADIX_TREE_EXCEPTIONAL_SHIFT) <
BITS_PER_LONG) {
bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
radix_tree_iter_replace(root, , slot,
(void *)((1UL << bit) |
RADIX_TREE_EXCEPTIONAL_ENTRY));
*id = new;
return 0;
}

   operates on bit, which is the offset from index*IDA_BITMAP_BITS, and
   it seems to create an exceptional entry somewhere down the tree
   (which may of course be the root).

   But we don't seem to allocate another bit from that exceptional entry
   ever unless it happened to sit at index 0; the code

unsigned long tmp = (unsigned long)bitmap;
if (start < BITS_PER_LONG) {
unsigned tbit = find_next_zero_bit(,
BITS_PER_LONG, start);
if (tbit < BITS_PER_LONG) {
tmp |= 1UL << tbit;
rcu_assign_pointer(*slot, (void *)tmp);
*id = new + tbit -
RADIX_TREE_EXCEPTIONAL_SHIFT;
return 0;
}
}

   is only used for small values of start (though it does seem to
   account for a non-zero value of new == iter.index * IDA_BITMAP_BITS).
   
- In the micro-optimization department, I think we should avoid
  find_next_zero_bit on a single word, and instead use __ffs. Something
  like (assuming we can use bit instead of start)

  if (bit + RADIX_TREE_EXCEPTIONAL_SHIFT < BITS_PER_LONG) {
tmp = (~(unsigned long)bitmap) >> (bit + RADIX_TREE_EXCEPTIONAL_SHIFT);
if (tmp) {
  tbit = __ffs(tmp) + bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
  tmp = (unsigned long)bitmap | (1UL << tbit);
  rcu_assign_pointer(*slot, (void *)tmp);
  *id = new + tbit - RADIX_TREE_EXCEPTIONAL_SHIFT;
  return 0;
}
  }

Rasmus

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-22 Thread Rasmus Villemoes

On Sat, Dec 17 2016, Matthew Wilcox  wrote:

> From: Matthew Wilcox
>> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
>> > This sounds good. I think there may still be a lot of users that never
>> > allocate more than a handful of IDAs, making a 128 byte allocation still
>> > somewhat excessive. One thing I considered was (exactly as it's done for
>> > file descriptor tables) to embed a single word in the struct ida and
>> > use that initially; I haven't looked closely at newIDA, so I don't know
>> > how easy that would be or if its worth the complexity.
>> 
>> Heh, I was thinking about that too.  The radix tree supports "exceptional
>> entries" which have the bottom bit set.  On a 64-bit machine, we could use 62
>> of the bits in the radix tree root to store the ID bitmap.  I'm a little 
>> wary of the
>> potential complexity, but we should try it out.
>
> Test patch here: 
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
> It passes the test suite ... which I actually had to adjust because it
> now succeeds in cases where it hadn't (allocating ID 0 without
> preallocating), and it will now fail in cases where it hadn't
> previously (assuming a single preallocation would be enough).  There
> shouldn't be any examples of that in the kernel proper; it was simply
> me being lazy when I wrote the test suite.

Nice work! A few random comments/questions:

- It does add some complexity, but I think a few comments would make it
  more digestable.

- Hm, maybe I'm confused, and I certainly don't understand how the whole
  radix tree works. But do you use every leaf node as an exceptional
  entry initially, to allocate the first 62 ids from that level? This
  code

if ((bit + RADIX_TREE_EXCEPTIONAL_SHIFT) <
BITS_PER_LONG) {
bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
radix_tree_iter_replace(root, , slot,
(void *)((1UL << bit) |
RADIX_TREE_EXCEPTIONAL_ENTRY));
*id = new;
return 0;
}

   operates on bit, which is the offset from index*IDA_BITMAP_BITS, and
   it seems to create an exceptional entry somewhere down the tree
   (which may of course be the root).

   But we don't seem to allocate another bit from that exceptional entry
   ever unless it happened to sit at index 0; the code

unsigned long tmp = (unsigned long)bitmap;
if (start < BITS_PER_LONG) {
unsigned tbit = find_next_zero_bit(,
BITS_PER_LONG, start);
if (tbit < BITS_PER_LONG) {
tmp |= 1UL << tbit;
rcu_assign_pointer(*slot, (void *)tmp);
*id = new + tbit -
RADIX_TREE_EXCEPTIONAL_SHIFT;
return 0;
}
}

   is only used for small values of start (though it does seem to
   account for a non-zero value of new == iter.index * IDA_BITMAP_BITS).
   
- In the micro-optimization department, I think we should avoid
  find_next_zero_bit on a single word, and instead use __ffs. Something
  like (assuming we can use bit instead of start)

  if (bit + RADIX_TREE_EXCEPTIONAL_SHIFT < BITS_PER_LONG) {
tmp = (~(unsigned long)bitmap) >> (bit + RADIX_TREE_EXCEPTIONAL_SHIFT);
if (tmp) {
  tbit = __ffs(tmp) + bit + RADIX_TREE_EXCEPTIONAL_SHIFT;
  tmp = (unsigned long)bitmap | (1UL << tbit);
  rcu_assign_pointer(*slot, (void *)tmp);
  *id = new + tbit - RADIX_TREE_EXCEPTIONAL_SHIFT;
  return 0;
}
  }

Rasmus

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-17 Thread Matthew Wilcox

From: Matthew Wilcox
> From: Matthew Wilcox
> > Heh, I was thinking about that too.  The radix tree supports "exceptional
> > entries" which have the bottom bit set.  On a 64-bit machine, we could use
> 62
> > of the bits in the radix tree root to store the ID bitmap.  I'm a little 
> > wary of
> the
> > potential complexity, but we should try it out.
> 
> Test patch here: http://git.infradead.org/users/willy/linux-
> dax.git/shortlog/refs/heads/idr-2016-12-16
> It passes the test suite ... which I actually had to adjust because it now 
> succeeds
> in cases where it hadn't (allocating ID 0 without preallocating), and it will 
> now
> fail in cases where it hadn't previously (assuming a single preallocation 
> would
> be enough).  There shouldn't be any examples of that in the kernel proper; it
> was simply me being lazy when I wrote the test suite.

Found a bug, committed a fix (and another test case).  It now no longer returns 
-EAGAIN in situations where it wouldn't have, so I've reverted that bit of the 
test suite change.  Still succeeds when it wouldn't have before, but I feel no 
pressure to change that ;-)

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-17 Thread Matthew Wilcox

From: Matthew Wilcox
> From: Matthew Wilcox
> > Heh, I was thinking about that too.  The radix tree supports "exceptional
> > entries" which have the bottom bit set.  On a 64-bit machine, we could use
> 62
> > of the bits in the radix tree root to store the ID bitmap.  I'm a little 
> > wary of
> the
> > potential complexity, but we should try it out.
> 
> Test patch here: http://git.infradead.org/users/willy/linux-
> dax.git/shortlog/refs/heads/idr-2016-12-16
> It passes the test suite ... which I actually had to adjust because it now 
> succeeds
> in cases where it hadn't (allocating ID 0 without preallocating), and it will 
> now
> fail in cases where it hadn't previously (assuming a single preallocation 
> would
> be enough).  There shouldn't be any examples of that in the kernel proper; it
> was simply me being lazy when I wrote the test suite.

Found a bug, committed a fix (and another test case).  It now no longer returns 
-EAGAIN in situations where it wouldn't have, so I've reverted that bit of the 
test suite change.  Still succeeds when it wouldn't have before, but I feel no 
pressure to change that ;-)

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-17 Thread Matthew Wilcox

From: Matthew Wilcox
> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> > This sounds good. I think there may still be a lot of users that never
> > allocate more than a handful of IDAs, making a 128 byte allocation still
> > somewhat excessive. One thing I considered was (exactly as it's done for
> > file descriptor tables) to embed a single word in the struct ida and
> > use that initially; I haven't looked closely at newIDA, so I don't know
> > how easy that would be or if its worth the complexity.
> 
> Heh, I was thinking about that too.  The radix tree supports "exceptional
> entries" which have the bottom bit set.  On a 64-bit machine, we could use 62
> of the bits in the radix tree root to store the ID bitmap.  I'm a little wary 
> of the
> potential complexity, but we should try it out.

Test patch here: 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
It passes the test suite ... which I actually had to adjust because it now 
succeeds in cases where it hadn't (allocating ID 0 without preallocating), and 
it will now fail in cases where it hadn't previously (assuming a single 
preallocation would be enough).  There shouldn't be any examples of that in the 
kernel proper; it was simply me being lazy when I wrote the test suite.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-17 Thread Matthew Wilcox

From: Matthew Wilcox
> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> > This sounds good. I think there may still be a lot of users that never
> > allocate more than a handful of IDAs, making a 128 byte allocation still
> > somewhat excessive. One thing I considered was (exactly as it's done for
> > file descriptor tables) to embed a single word in the struct ida and
> > use that initially; I haven't looked closely at newIDA, so I don't know
> > how easy that would be or if its worth the complexity.
> 
> Heh, I was thinking about that too.  The radix tree supports "exceptional
> entries" which have the bottom bit set.  On a 64-bit machine, we could use 62
> of the bits in the radix tree root to store the ID bitmap.  I'm a little wary 
> of the
> potential complexity, but we should try it out.

Test patch here: 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
It passes the test suite ... which I actually had to adjust because it now 
succeeds in cases where it hadn't (allocating ID 0 without preallocating), and 
it will now fail in cases where it hadn't previously (assuming a single 
preallocation would be enough).  There shouldn't be any examples of that in the 
kernel proper; it was simply me being lazy when I wrote the test suite.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Matthew Wilcox

From: Andrew Morton [mailto:a...@linux-foundation.org]
> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes
>  wrote:
> > TL;DR: these patches save 250 KB of memory, with more low-hanging
> > fruit ready to pick.
> >
> > While browsing through the lib/idr.c code, I noticed that the code at
> > the end of ida_get_new_above() probably doesn't work as intended: Most
> > users of ida use it via ida_simple_get(), and that starts by
> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> > case, none (or at most one) of these get used during
> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
> > 7) idr_layers in the free list.
> 
> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
> the above patch (#33) into 4.11-rc1.

Hi Rasmus,

Thanks for your work on this; you've really put some effort into proving your 
work has value.  My motivation was purely aesthetic, but you've got some 
genuine savings here (admittedly it's about a quarter of a cent's worth of 
memory with DRAM selling for $10/GB).  Nevertheless, that adds up over a 
billion devices, and there are still people trying to fit Linux into 4MB 
embedded devices.

I think my reimplementation of the IDA on top of the radix tree is close enough 
to your tIDA in memory consumption that it doesn't warrant a new data structure.

On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16 bytes.  
If you allocate only one entry, you'll allocate 8 bytes.  Thanks to the slab 
allocator, that gets rounded up to 32 bytes.  I allocate the full 128 byte 
leaf, but I store the pointer to it in the root (unlike the IDR, the radix tree 
doesn't need to allocate a layer for a single entry).  So tIDA wins on memory 
consumption between 1 and 511 IDs, and newIDA is slightly ahead between 512 and 
1023 IDs.  Above 1024 IDs, I allocate a layer (576 bytes), and a second leaf 
(832 bytes total), while you just double to 256 bytes.  I think tIDA's memory 
consumption then stays ahead of new IDA.  But performance of 'allocate new ID' 
should be better for newIDA than tIDA as newIDA can skip over all the 
cachelines of full bitmaps.

Yesterday, I found a new problem with the IDA allocator that you hadn't 
mentioned -- about half of the users of the IDA data structure never call 
destroy_ida().  Which means that they're leaking the preloaded bitmap.  I have 
a patch which moves the preloaded IDA bitmap from being stored in the IDA to 
being stored in a percpu variable.  You can find it here: 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
 I'd welcome more testing and code review.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Matthew Wilcox

From: Andrew Morton [mailto:a...@linux-foundation.org]
> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes
>  wrote:
> > TL;DR: these patches save 250 KB of memory, with more low-hanging
> > fruit ready to pick.
> >
> > While browsing through the lib/idr.c code, I noticed that the code at
> > the end of ida_get_new_above() probably doesn't work as intended: Most
> > users of ida use it via ida_simple_get(), and that starts by
> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> > case, none (or at most one) of these get used during
> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
> > 7) idr_layers in the free list.
> 
> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
> the above patch (#33) into 4.11-rc1.

Hi Rasmus,

Thanks for your work on this; you've really put some effort into proving your 
work has value.  My motivation was purely aesthetic, but you've got some 
genuine savings here (admittedly it's about a quarter of a cent's worth of 
memory with DRAM selling for $10/GB).  Nevertheless, that adds up over a 
billion devices, and there are still people trying to fit Linux into 4MB 
embedded devices.

I think my reimplementation of the IDA on top of the radix tree is close enough 
to your tIDA in memory consumption that it doesn't warrant a new data structure.

On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16 bytes.  
If you allocate only one entry, you'll allocate 8 bytes.  Thanks to the slab 
allocator, that gets rounded up to 32 bytes.  I allocate the full 128 byte 
leaf, but I store the pointer to it in the root (unlike the IDR, the radix tree 
doesn't need to allocate a layer for a single entry).  So tIDA wins on memory 
consumption between 1 and 511 IDs, and newIDA is slightly ahead between 512 and 
1023 IDs.  Above 1024 IDs, I allocate a layer (576 bytes), and a second leaf 
(832 bytes total), while you just double to 256 bytes.  I think tIDA's memory 
consumption then stays ahead of new IDA.  But performance of 'allocate new ID' 
should be better for newIDA than tIDA as newIDA can skip over all the 
cachelines of full bitmaps.

Yesterday, I found a new problem with the IDA allocator that you hadn't 
mentioned -- about half of the users of the IDA data structure never call 
destroy_ida().  Which means that they're leaking the preloaded bitmap.  I have 
a patch which moves the preloaded IDA bitmap from being stored in the IDA to 
being stored in a percpu variable.  You can find it here: 
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/idr-2016-12-16
 I'd welcome more testing and code review.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Matthew Wilcox

From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> On Fri, Dec 16 2016, Matthew Wilcox  wrote:
> > Thanks for your work on this; you've really put some effort into
> > proving your work has value.  My motivation was purely aesthetic, but
> > you've got some genuine savings here (admittedly it's about a quarter
> > of a cent's worth of memory with DRAM selling for $10/GB).
> > Nevertheless, that adds up over a billion devices, and there are still
> > people trying to fit Linux into 4MB embedded devices.
> >
> 
> Yeah, my main motivation was embedded devices which don't have the
> luxury of measuring their RAM in GB. E.g., it's crazy that the
> watchdog_ida effectively use more memory than the .text of the watchdog
> subsystem, and similarly for the kthread workers, etc., etc.. I didn't
> mean for my patches to go in as is, more to provoke some discussion. I
> wasn't aware of your reimplementation, but it seems that may make the
> problem go away.

It certainly shrinks the problem down to a size where it may not be worth 
introducing another implementation.

> > On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16
> > bytes.  If you allocate only one entry, you'll allocate 8 bytes.
> > Thanks to the slab allocator, that gets rounded up to 32 bytes.  I
> > allocate the full 128 byte leaf, but I store the pointer to it in the
> > root (unlike the IDR, the radix tree doesn't need to allocate a layer
> > for a single entry).  So tIDA wins on memory consumption between 1 and
> > 511 IDs, and newIDA is slightly ahead between 512 and 1023 IDs.
> 
> This sounds good. I think there may still be a lot of users that never
> allocate more than a handful of IDAs, making a 128 byte allocation still
> somewhat excessive. One thing I considered was (exactly as it's done for
> file descriptor tables) to embed a single word in the struct ida and
> use that initially; I haven't looked closely at newIDA, so I don't know
> how easy that would be or if its worth the complexity.

Heh, I was thinking about that too.  The radix tree supports "exceptional 
entries" which have the bottom bit set.  On a 64-bit machine, we could use 62 
of the bits in the radix tree root to store the ID bitmap.  I'm a little wary 
of the potential complexity, but we should try it out.

Did you come up with any fun tests that could be added to the test-suite?  It 
feels a little slender right now.

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Matthew Wilcox

From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> On Fri, Dec 16 2016, Matthew Wilcox  wrote:
> > Thanks for your work on this; you've really put some effort into
> > proving your work has value.  My motivation was purely aesthetic, but
> > you've got some genuine savings here (admittedly it's about a quarter
> > of a cent's worth of memory with DRAM selling for $10/GB).
> > Nevertheless, that adds up over a billion devices, and there are still
> > people trying to fit Linux into 4MB embedded devices.
> >
> 
> Yeah, my main motivation was embedded devices which don't have the
> luxury of measuring their RAM in GB. E.g., it's crazy that the
> watchdog_ida effectively use more memory than the .text of the watchdog
> subsystem, and similarly for the kthread workers, etc., etc.. I didn't
> mean for my patches to go in as is, more to provoke some discussion. I
> wasn't aware of your reimplementation, but it seems that may make the
> problem go away.

It certainly shrinks the problem down to a size where it may not be worth 
introducing another implementation.

> > On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16
> > bytes.  If you allocate only one entry, you'll allocate 8 bytes.
> > Thanks to the slab allocator, that gets rounded up to 32 bytes.  I
> > allocate the full 128 byte leaf, but I store the pointer to it in the
> > root (unlike the IDR, the radix tree doesn't need to allocate a layer
> > for a single entry).  So tIDA wins on memory consumption between 1 and
> > 511 IDs, and newIDA is slightly ahead between 512 and 1023 IDs.
> 
> This sounds good. I think there may still be a lot of users that never
> allocate more than a handful of IDAs, making a 128 byte allocation still
> somewhat excessive. One thing I considered was (exactly as it's done for
> file descriptor tables) to embed a single word in the struct ida and
> use that initially; I haven't looked closely at newIDA, so I don't know
> how easy that would be or if its worth the complexity.

Heh, I was thinking about that too.  The radix tree supports "exceptional 
entries" which have the bottom bit set.  On a 64-bit machine, we could use 62 
of the bits in the radix tree root to store the ID bitmap.  I'm a little wary 
of the potential complexity, but we should try it out.

Did you come up with any fun tests that could be added to the test-suite?  It 
feels a little slender right now.

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Rasmus Villemoes

On Fri, Dec 16 2016, Matthew Wilcox  wrote:

> From: Andrew Morton [mailto:a...@linux-foundation.org]
>> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes
>>  wrote:
>> > TL;DR: these patches save 250 KB of memory, with more low-hanging
>> > fruit ready to pick.
>> >
>> > While browsing through the lib/idr.c code, I noticed that the code at
>> > the end of ida_get_new_above() probably doesn't work as intended: Most
>> > users of ida use it via ida_simple_get(), and that starts by
>> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
>> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
>> > case, none (or at most one) of these get used during
>> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
>> > 7) idr_layers in the free list.
>> 
>> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
>> the above patch (#33) into 4.11-rc1.
>
> Hi Rasmus,
>
> Thanks for your work on this; you've really put some effort into
> proving your work has value.  My motivation was purely aesthetic, but
> you've got some genuine savings here (admittedly it's about a quarter
> of a cent's worth of memory with DRAM selling for $10/GB).
> Nevertheless, that adds up over a billion devices, and there are still
> people trying to fit Linux into 4MB embedded devices.
>

Yeah, my main motivation was embedded devices which don't have the
luxury of measuring their RAM in GB. E.g., it's crazy that the
watchdog_ida effectively use more memory than the .text of the watchdog
subsystem, and similarly for the kthread workers, etc., etc.. I didn't
mean for my patches to go in as is, more to provoke some discussion. I
wasn't aware of your reimplementation, but it seems that may make the
problem go away.

> I think my reimplementation of the IDA on top of the radix tree is
> close enough to your tIDA in memory consumption that it doesn't
> warrant a new data structure.
>
> On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16
> bytes.  If you allocate only one entry, you'll allocate 8 bytes.
> Thanks to the slab allocator, that gets rounded up to 32 bytes.  I
> allocate the full 128 byte leaf, but I store the pointer to it in the
> root (unlike the IDR, the radix tree doesn't need to allocate a layer
> for a single entry).  So tIDA wins on memory consumption between 1 and
> 511 IDs, and newIDA is slightly ahead between 512 and 1023 IDs.

This sounds good. I think there may still be a lot of users that never
allocate more than a handful of IDAs, making a 128 byte allocation still
somewhat excessive. One thing I considered was (exactly as it's done for
file descriptor tables) to embed a single word in the struct ida and
use that initially; I haven't looked closely at newIDA, so I don't know
how easy that would be or if its worth the complexity.

Rasmus

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-16 Thread Rasmus Villemoes

On Fri, Dec 16 2016, Matthew Wilcox  wrote:

> From: Andrew Morton [mailto:a...@linux-foundation.org]
>> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes
>>  wrote:
>> > TL;DR: these patches save 250 KB of memory, with more low-hanging
>> > fruit ready to pick.
>> >
>> > While browsing through the lib/idr.c code, I noticed that the code at
>> > the end of ida_get_new_above() probably doesn't work as intended: Most
>> > users of ida use it via ida_simple_get(), and that starts by
>> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
>> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
>> > case, none (or at most one) of these get used during
>> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
>> > 7) idr_layers in the free list.
>> 
>> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
>> the above patch (#33) into 4.11-rc1.
>
> Hi Rasmus,
>
> Thanks for your work on this; you've really put some effort into
> proving your work has value.  My motivation was purely aesthetic, but
> you've got some genuine savings here (admittedly it's about a quarter
> of a cent's worth of memory with DRAM selling for $10/GB).
> Nevertheless, that adds up over a billion devices, and there are still
> people trying to fit Linux into 4MB embedded devices.
>

Yeah, my main motivation was embedded devices which don't have the
luxury of measuring their RAM in GB. E.g., it's crazy that the
watchdog_ida effectively use more memory than the .text of the watchdog
subsystem, and similarly for the kthread workers, etc., etc.. I didn't
mean for my patches to go in as is, more to provoke some discussion. I
wasn't aware of your reimplementation, but it seems that may make the
problem go away.

> I think my reimplementation of the IDA on top of the radix tree is
> close enough to your tIDA in memory consumption that it doesn't
> warrant a new data structure.
>
> On a 64-bit machine, your tIDA root is 24 bytes; my new IDA root is 16
> bytes.  If you allocate only one entry, you'll allocate 8 bytes.
> Thanks to the slab allocator, that gets rounded up to 32 bytes.  I
> allocate the full 128 byte leaf, but I store the pointer to it in the
> root (unlike the IDR, the radix tree doesn't need to allocate a layer
> for a single entry).  So tIDA wins on memory consumption between 1 and
> 511 IDs, and newIDA is slightly ahead between 512 and 1023 IDs.

This sounds good. I think there may still be a lot of users that never
allocate more than a handful of IDAs, making a 128 byte allocation still
somewhat excessive. One thing I considered was (exactly as it's done for
file descriptor tables) to embed a single word in the struct ida and
use that initially; I haven't looked closely at newIDA, so I don't know
how easy that would be or if its worth the complexity.

Rasmus

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Tejun Heo

Hello, Matthew.

On Mon, Dec 12, 2016 at 05:35:17PM +, Matthew Wilcox wrote:
> I know the preload followed by preload_end looks wrong.  I don't
> think it's broken though.  If we get preempted, then the worst
> situation is that we'll end up with the memory we preallocated being
> allocated to somebody else.  Then the attempt to allocate memory can
> fail, and we'll return -EAGAIN, at which point all callers are
> supposed to return to the pre_get() stage.  Certainly that's what
> ida_simple_get() does.

Ah, right, ida_pre_get() doesn't have any protection against other
task allocating inbetween pre_get and the actual allocation, so it
should retry on failure.  Yeah, then the proposed preloading wouldn't
be wrong.  It'd be nice to explain what's going on tho.

> I'd definitely be open to changing the IDA API.  I know Kent had
> some thoughts on that including splitting the simple lock into a
> per-IDA lock.  His last work on it was here, I believe:
> 
> https://evilpiepirate.org/git/linux-bcache.git/log/?h=idr

Yeah, that was a big re-write, but for now I think it'd be nice to
replace ida's pre_get mechanism with something similar to idr's
preload so that they're more consistent.  There aren't that many
direct users of ida_pre_get(), so it shouldn't be too difficult to
change.

Thanks.

-- 
tejun

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Tejun Heo

Hello, Matthew.

On Mon, Dec 12, 2016 at 05:35:17PM +, Matthew Wilcox wrote:
> I know the preload followed by preload_end looks wrong.  I don't
> think it's broken though.  If we get preempted, then the worst
> situation is that we'll end up with the memory we preallocated being
> allocated to somebody else.  Then the attempt to allocate memory can
> fail, and we'll return -EAGAIN, at which point all callers are
> supposed to return to the pre_get() stage.  Certainly that's what
> ida_simple_get() does.

Ah, right, ida_pre_get() doesn't have any protection against other
task allocating inbetween pre_get and the actual allocation, so it
should retry on failure.  Yeah, then the proposed preloading wouldn't
be wrong.  It'd be nice to explain what's going on tho.

> I'd definitely be open to changing the IDA API.  I know Kent had
> some thoughts on that including splitting the simple lock into a
> per-IDA lock.  His last work on it was here, I believe:
> 
> https://evilpiepirate.org/git/linux-bcache.git/log/?h=idr

Yeah, that was a big re-write, but for now I think it'd be nice to
replace ida's pre_get mechanism with something similar to idr's
preload so that they're more consistent.  There aren't that many
direct users of ida_pre_get(), so it shouldn't be too difficult to
change.

Thanks.

-- 
tejun

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Matthew Wilcox

From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of Tejun Heo
> Ah, yeah, great to see the silly implementation being replaced the
> radix tree.  ida_pre_get() looks suspicious tho.  idr_preload()
> immedicately being followed by idr_preload_end() probably is broken.
> Maybe what we need is moving ida to idr like preload interface and
> then convert it to radix based interface?  ida currently assumes
> per-ida preloading.

Hey Tejun!  Great to have your comments on this reimplementation.

I know the preload followed by preload_end looks wrong.  I don't think it's 
broken though.  If we get preempted, then the worst situation is that we'll end 
up with the memory we preallocated being allocated to somebody else.  Then the 
attempt to allocate memory can fail, and we'll return -EAGAIN, at which point 
all callers are supposed to return to the pre_get() stage.  Certainly that's 
what ida_simple_get() does.

Hmm ... looking at the implementation again, we might return -ENOMEM when we 
should return -EAGAIN.  Let me fix that (and the test suite ...)

I'd definitely be open to changing the IDA API.  I know Kent had some thoughts 
on that including splitting the simple lock into a per-IDA lock.  His last work 
on it was here, I believe:

https://evilpiepirate.org/git/linux-bcache.git/log/?h=idr

RE: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Matthew Wilcox

From: Tejun Heo [mailto:hte...@gmail.com] On Behalf Of Tejun Heo
> Ah, yeah, great to see the silly implementation being replaced the
> radix tree.  ida_pre_get() looks suspicious tho.  idr_preload()
> immedicately being followed by idr_preload_end() probably is broken.
> Maybe what we need is moving ida to idr like preload interface and
> then convert it to radix based interface?  ida currently assumes
> per-ida preloading.

Hey Tejun!  Great to have your comments on this reimplementation.

I know the preload followed by preload_end looks wrong.  I don't think it's 
broken though.  If we get preempted, then the worst situation is that we'll end 
up with the memory we preallocated being allocated to somebody else.  Then the 
attempt to allocate memory can fail, and we'll return -EAGAIN, at which point 
all callers are supposed to return to the pre_get() stage.  Certainly that's 
what ida_simple_get() does.

Hmm ... looking at the implementation again, we might return -ENOMEM when we 
should return -EAGAIN.  Let me fix that (and the test suite ...)

I'd definitely be open to changing the IDA API.  I know Kent had some thoughts 
on that including splitting the simple lock into a per-IDA lock.  His last work 
on it was here, I believe:

https://evilpiepirate.org/git/linux-bcache.git/log/?h=idr

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Tejun Heo

Hello,

On Fri, Dec 09, 2016 at 02:01:40PM -0800, Andrew Morton wrote:
> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes 
>  wrote:
> 
> > TL;DR: these patches save 250 KB of memory, with more low-hanging
> > fruit ready to pick.
> > 
> > While browsing through the lib/idr.c code, I noticed that the code at
> > the end of ida_get_new_above() probably doesn't work as intended: Most
> > users of ida use it via ida_simple_get(), and that starts by
> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> > case, none (or at most one) of these get used during
> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
> > 7) idr_layers in the free list.
> 
> Please be aware of
> 
> http://ozlabs.org/~akpm/mmots/broken-out/reimplement-idr-and-ida-using-the-radix-tree.patch
> http://lkml.kernel.org/r/1480369871-5271-68-git-send-email-mawil...@linuxonhyperv.com
> 
> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
> the above patch (#33) into 4.11-rc1.

Ah, yeah, great to see the silly implementation being replaced the
radix tree.  ida_pre_get() looks suspicious tho.  idr_preload()
immedicately being followed by idr_preload_end() probably is broken.
Maybe what we need is moving ida to idr like preload interface and
then convert it to radix based interface?  ida currently assumes
per-ida preloading.

Thanks.

-- 
tejun

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-12 Thread Tejun Heo

Hello,

On Fri, Dec 09, 2016 at 02:01:40PM -0800, Andrew Morton wrote:
> On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes 
>  wrote:
> 
> > TL;DR: these patches save 250 KB of memory, with more low-hanging
> > fruit ready to pick.
> > 
> > While browsing through the lib/idr.c code, I noticed that the code at
> > the end of ida_get_new_above() probably doesn't work as intended: Most
> > users of ida use it via ida_simple_get(), and that starts by
> > unconditionally calling ida_pre_get(), ensuring that ida->idr has
> > 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> > case, none (or at most one) of these get used during
> > ida_get_new_above(), and we only free one, leaving at least 6 (usually
> > 7) idr_layers in the free list.
> 
> Please be aware of
> 
> http://ozlabs.org/~akpm/mmots/broken-out/reimplement-idr-and-ida-using-the-radix-tree.patch
> http://lkml.kernel.org/r/1480369871-5271-68-git-send-email-mawil...@linuxonhyperv.com
> 
> I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
> the above patch (#33) into 4.11-rc1.

Ah, yeah, great to see the silly implementation being replaced the
radix tree.  ida_pre_get() looks suspicious tho.  idr_preload()
immedicately being followed by idr_preload_end() probably is broken.
Maybe what we need is moving ida to idr like preload interface and
then convert it to radix based interface?  ida currently assumes
per-ida preloading.

Thanks.

-- 
tejun

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-09 Thread Andrew Morton

On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes  
wrote:

> TL;DR: these patches save 250 KB of memory, with more low-hanging
> fruit ready to pick.
> 
> While browsing through the lib/idr.c code, I noticed that the code at
> the end of ida_get_new_above() probably doesn't work as intended: Most
> users of ida use it via ida_simple_get(), and that starts by
> unconditionally calling ida_pre_get(), ensuring that ida->idr has
> 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> case, none (or at most one) of these get used during
> ida_get_new_above(), and we only free one, leaving at least 6 (usually
> 7) idr_layers in the free list.

Please be aware of

http://ozlabs.org/~akpm/mmots/broken-out/reimplement-idr-and-ida-using-the-radix-tree.patch
http://lkml.kernel.org/r/1480369871-5271-68-git-send-email-mawil...@linuxonhyperv.com

I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
the above patch (#33) into 4.11-rc1.

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-09 Thread Andrew Morton

On Thu,  8 Dec 2016 02:22:55 +0100 Rasmus Villemoes  
wrote:

> TL;DR: these patches save 250 KB of memory, with more low-hanging
> fruit ready to pick.
> 
> While browsing through the lib/idr.c code, I noticed that the code at
> the end of ida_get_new_above() probably doesn't work as intended: Most
> users of ida use it via ida_simple_get(), and that starts by
> unconditionally calling ida_pre_get(), ensuring that ida->idr has
> 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> case, none (or at most one) of these get used during
> ida_get_new_above(), and we only free one, leaving at least 6 (usually
> 7) idr_layers in the free list.

Please be aware of

http://ozlabs.org/~akpm/mmots/broken-out/reimplement-idr-and-ida-using-the-radix-tree.patch
http://lkml.kernel.org/r/1480369871-5271-68-git-send-email-mawil...@linuxonhyperv.com

I expect we'll be merging patches 1-32 of that series into 4.10-rc1 and
the above patch (#33) into 4.11-rc1.

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-09 Thread Tejun Heo

Hello, Rasmus.

On Thu, Dec 08, 2016 at 02:22:55AM +0100, Rasmus Villemoes wrote:
> TL;DR: these patches save 250 KB of memory, with more low-hanging
> fruit ready to pick.
> 
> While browsing through the lib/idr.c code, I noticed that the code at
> the end of ida_get_new_above() probably doesn't work as intended: Most
> users of ida use it via ida_simple_get(), and that starts by
> unconditionally calling ida_pre_get(), ensuring that ida->idr has
> 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> case, none (or at most one) of these get used during
> ida_get_new_above(), and we only free one, leaving at least 6 (usually
> 7) idr_layers in the free list.

So, if you take a look at idr_layer_alloc(), there are two alternative
preloading mechanisms.  The one based on per-idr freelist -
get_from_free_list() - and the one using per-cpu preload cache.  idr
currently uses the new percpu path and ida uses the old path.  There
isn't anything fundamental to this difference.  It's just that we
introduced the new perpcu path to idr and haven't converted ida over
to it yet.

> Patches 1 and 2 are minor optimization opportunities, while patch 3 is
> an attempt at making the 'free the extra idr_layers one at a time'
> actually work. But it's not a very good solution, since it doesn't
> help the users who never do more than a handful of allocations, nor
> does it help those who call ida_pre_get/ida_get_new
> directly. Moreover, even if we somehow had perfect heuristics and got
> rid of all excess idr_layers and ida_bitmap (another 128 bytes we
> usually have hanging around), the minimum overhead of sizeof(struct
> idr_layer)+sizeof(struct ida_bitmap) ~ 2200 bytes is quite a lot for
> the many ida users who never allocate more than 100 ids.
> 
> So instead/in addition, I suggest we introduce a much simpler
> allocator based on a dynamically growing bitmap. Patches 4-10
> introduce struct tida and has a few examples of replacing ida with
> tida. [Yeah, tida is not a good name, and probably _get and _put are also
> bad.]

So, instead of creating something new, it'd be a lot better to
implement the same per-cpu preload behavior for ida so that caching is
per-cpu instead of per-ida.

Thanks.

-- 
tejun

Re: [RFC 00/10] implement alternative and much simpler id allocator

2016-12-09 Thread Tejun Heo

Hello, Rasmus.

On Thu, Dec 08, 2016 at 02:22:55AM +0100, Rasmus Villemoes wrote:
> TL;DR: these patches save 250 KB of memory, with more low-hanging
> fruit ready to pick.
> 
> While browsing through the lib/idr.c code, I noticed that the code at
> the end of ida_get_new_above() probably doesn't work as intended: Most
> users of ida use it via ida_simple_get(), and that starts by
> unconditionally calling ida_pre_get(), ensuring that ida->idr has
> 8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
> case, none (or at most one) of these get used during
> ida_get_new_above(), and we only free one, leaving at least 6 (usually
> 7) idr_layers in the free list.

So, if you take a look at idr_layer_alloc(), there are two alternative
preloading mechanisms.  The one based on per-idr freelist -
get_from_free_list() - and the one using per-cpu preload cache.  idr
currently uses the new percpu path and ida uses the old path.  There
isn't anything fundamental to this difference.  It's just that we
introduced the new perpcu path to idr and haven't converted ida over
to it yet.

> Patches 1 and 2 are minor optimization opportunities, while patch 3 is
> an attempt at making the 'free the extra idr_layers one at a time'
> actually work. But it's not a very good solution, since it doesn't
> help the users who never do more than a handful of allocations, nor
> does it help those who call ida_pre_get/ida_get_new
> directly. Moreover, even if we somehow had perfect heuristics and got
> rid of all excess idr_layers and ida_bitmap (another 128 bytes we
> usually have hanging around), the minimum overhead of sizeof(struct
> idr_layer)+sizeof(struct ida_bitmap) ~ 2200 bytes is quite a lot for
> the many ida users who never allocate more than 100 ids.
> 
> So instead/in addition, I suggest we introduce a much simpler
> allocator based on a dynamically growing bitmap. Patches 4-10
> introduce struct tida and has a few examples of replacing ida with
> tida. [Yeah, tida is not a good name, and probably _get and _put are also
> bad.]

So, instead of creating something new, it'd be a lot better to
implement the same per-cpu preload behavior for ida so that caching is
per-cpu instead of per-ida.

Thanks.

-- 
tejun

[RFC 00/10] implement alternative and much simpler id allocator

2016-12-07 Thread Rasmus Villemoes

TL;DR: these patches save 250 KB of memory, with more low-hanging
fruit ready to pick.

While browsing through the lib/idr.c code, I noticed that the code at
the end of ida_get_new_above() probably doesn't work as intended: Most
users of ida use it via ida_simple_get(), and that starts by
unconditionally calling ida_pre_get(), ensuring that ida->idr has
8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
case, none (or at most one) of these get used during
ida_get_new_above(), and we only free one, leaving at least 6 (usually
7) idr_layers in the free list.

The comment suggests that the idea is that subsequent allocations
would get rid of these one by one, but this doesn't work due to the
unconditional filling done by ida_simple_get - and callers who do
their own locking presumably call ida_pre_get themselves, with the
same effect.

To confirm my suspicion, and to see how many idas are actually in use
and how much they're utilized, I hacked together a /proc/ida file to
provide some stats on the currently active idas. Some examples of
lines from that on my laptop:

file:line   getsputsdifflayers  id_free_cnt
block/blk-core.c:49 27  0   27  1   7
drivers/base/platform.c:35  1   0   1   1   6
drivers/gpu/drm/drm_crtc.c:201  0   0   0   0   0
drivers/gpu/drm/drm_crtc.c:201  0   0   0   0   0
...
drivers/gpu/drm/drm_crtc.c:201  1   0   1   1   6
drivers/gpu/drm/drm_crtc.c:201  2   0   2   1   7
drivers/gpu/drm/drm_crtc.c:5700 5   0   5   1   7
...
drivers/scsi/hosts.c:45 6   0   6   1   7
drivers/scsi/sd.c:123   1   0   1   1   6
fs/devpts/inode.c:386   57  40  17  1   7
fs/kernfs/dir.c:918 10  0   10  1   7
fs/kernfs/dir.c:918 11  0   11  1   7
...
fs/kernfs/dir.c:918 24799   160923190   1   7
...
fs/super.c:882  49  5   44  1   7
kernel/workqueue.c:3198 10  8   2   1   7
kernel/workqueue.c:3198 11  9   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 94  91  3   1   7

file:line is the location of the ida_init or DEFINE_IDA, gets and puts
are the lifetime number of allocations/frees - their difference is
thus the current number of allocated ids. layers and id_free_cnt are
the members of ida->idr.

As expected, there are 0, 6 or 7 idr_layers cached and unused in each
ida, depending on whether there have been 0, 1 or more allocations
done. Add the idr_layer which is in use, and we see that e.g. the
workqueue instances use more than 6*8*sizeof(struct idr_layer) ~ 100K
of memory, or almost 1K per allocated id.

Patches 1 and 2 are minor optimization opportunities, while patch 3 is
an attempt at making the 'free the extra idr_layers one at a time'
actually work. But it's not a very good solution, since it doesn't
help the users who never do more than a handful of allocations, nor
does it help those who call ida_pre_get/ida_get_new
directly. Moreover, even if we somehow had perfect heuristics and got
rid of all excess idr_layers and ida_bitmap (another 128 bytes we
usually have hanging around), the minimum overhead of sizeof(struct
idr_layer)+sizeof(struct ida_bitmap) ~ 2200 bytes is quite a lot for
the many ida users who never allocate more than 100 ids.

So instead/in addition, I suggest we introduce a much simpler
allocator based on a dynamically growing bitmap. Patches 4-10
introduce struct tida and has a few examples of replacing ida with
tida. [Yeah, tida is not a good name, and probably _get and _put are also
bad.]

I've booted a 4.8.12 kernel with these backported, and as expected
they save around 250KB of memory. There's more to gain, but this is
just an RFC.



Rasmus Villemoes (10):
  lib/idr.c: reused free bitmaps are already clear
  lib/idr.c: delete useless condition
  lib/idr.c: only fill ida->idr when needed
  lib/tida.c: a very simple integer id allocator
  kernel/workqueue.c: replace id allocator ida with tida
  block: use tida as small id allocator
  drivers/base/platform.c: use simpler id allocator
  lib/tida.c: introduce tida_get_above
  drm: use simpler id allocator
  fs/devpts: use tida for id allocation

 block/blk-core.c|   6 +--
 block/blk-sysfs.c   |   2 +-
 block/blk.h |   4 +-
 drivers/base/platform.c |  10 ++--
 drivers/gpu/drm/drm_connector.c |  21 
 drivers/gpu/drm/drm_crtc.c  |   4 +-
 fs/devpts/inode.c   |  28 ---
 include/drm/drm_crtc.h  |   3 +-
 include/linux/tida.h|  32 
 kernel/workqueue.c  |  14 +++---
 lib/Makefile|   2 +-
 lib/idr.c   |

[RFC 00/10] implement alternative and much simpler id allocator

2016-12-07 Thread Rasmus Villemoes

TL;DR: these patches save 250 KB of memory, with more low-hanging
fruit ready to pick.

While browsing through the lib/idr.c code, I noticed that the code at
the end of ida_get_new_above() probably doesn't work as intended: Most
users of ida use it via ida_simple_get(), and that starts by
unconditionally calling ida_pre_get(), ensuring that ida->idr has
8==MAX_IDR_FREE idr_layers in its free list id_free. In the common
case, none (or at most one) of these get used during
ida_get_new_above(), and we only free one, leaving at least 6 (usually
7) idr_layers in the free list.

The comment suggests that the idea is that subsequent allocations
would get rid of these one by one, but this doesn't work due to the
unconditional filling done by ida_simple_get - and callers who do
their own locking presumably call ida_pre_get themselves, with the
same effect.

To confirm my suspicion, and to see how many idas are actually in use
and how much they're utilized, I hacked together a /proc/ida file to
provide some stats on the currently active idas. Some examples of
lines from that on my laptop:

file:line   getsputsdifflayers  id_free_cnt
block/blk-core.c:49 27  0   27  1   7
drivers/base/platform.c:35  1   0   1   1   6
drivers/gpu/drm/drm_crtc.c:201  0   0   0   0   0
drivers/gpu/drm/drm_crtc.c:201  0   0   0   0   0
...
drivers/gpu/drm/drm_crtc.c:201  1   0   1   1   6
drivers/gpu/drm/drm_crtc.c:201  2   0   2   1   7
drivers/gpu/drm/drm_crtc.c:5700 5   0   5   1   7
...
drivers/scsi/hosts.c:45 6   0   6   1   7
drivers/scsi/sd.c:123   1   0   1   1   6
fs/devpts/inode.c:386   57  40  17  1   7
fs/kernfs/dir.c:918 10  0   10  1   7
fs/kernfs/dir.c:918 11  0   11  1   7
...
fs/kernfs/dir.c:918 24799   160923190   1   7
...
fs/super.c:882  49  5   44  1   7
kernel/workqueue.c:3198 10  8   2   1   7
kernel/workqueue.c:3198 11  9   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 2   0   2   1   7
kernel/workqueue.c:3198 94  91  3   1   7

file:line is the location of the ida_init or DEFINE_IDA, gets and puts
are the lifetime number of allocations/frees - their difference is
thus the current number of allocated ids. layers and id_free_cnt are
the members of ida->idr.

As expected, there are 0, 6 or 7 idr_layers cached and unused in each
ida, depending on whether there have been 0, 1 or more allocations
done. Add the idr_layer which is in use, and we see that e.g. the
workqueue instances use more than 6*8*sizeof(struct idr_layer) ~ 100K
of memory, or almost 1K per allocated id.

Patches 1 and 2 are minor optimization opportunities, while patch 3 is
an attempt at making the 'free the extra idr_layers one at a time'
actually work. But it's not a very good solution, since it doesn't
help the users who never do more than a handful of allocations, nor
does it help those who call ida_pre_get/ida_get_new
directly. Moreover, even if we somehow had perfect heuristics and got
rid of all excess idr_layers and ida_bitmap (another 128 bytes we
usually have hanging around), the minimum overhead of sizeof(struct
idr_layer)+sizeof(struct ida_bitmap) ~ 2200 bytes is quite a lot for
the many ida users who never allocate more than 100 ids.

So instead/in addition, I suggest we introduce a much simpler
allocator based on a dynamically growing bitmap. Patches 4-10
introduce struct tida and has a few examples of replacing ida with
tida. [Yeah, tida is not a good name, and probably _get and _put are also
bad.]

I've booted a 4.8.12 kernel with these backported, and as expected
they save around 250KB of memory. There's more to gain, but this is
just an RFC.



Rasmus Villemoes (10):
  lib/idr.c: reused free bitmaps are already clear
  lib/idr.c: delete useless condition
  lib/idr.c: only fill ida->idr when needed
  lib/tida.c: a very simple integer id allocator
  kernel/workqueue.c: replace id allocator ida with tida
  block: use tida as small id allocator
  drivers/base/platform.c: use simpler id allocator
  lib/tida.c: introduce tida_get_above
  drm: use simpler id allocator
  fs/devpts: use tida for id allocation

 block/blk-core.c|   6 +--
 block/blk-sysfs.c   |   2 +-
 block/blk.h |   4 +-
 drivers/base/platform.c |  10 ++--
 drivers/gpu/drm/drm_connector.c |  21 
 drivers/gpu/drm/drm_crtc.c  |   4 +-
 fs/devpts/inode.c   |  28 ---
 include/drm/drm_crtc.h  |   3 +-
 include/linux/tida.h|  32 
 kernel/workqueue.c  |  14 +++---
 lib/Makefile|   2 +-
 lib/idr.c   |

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

RE: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

Re: [RFC 00/10] implement alternative and much simpler id allocator

[RFC 00/10] implement alternative and much simpler id allocator

[RFC 00/10] implement alternative and much simpler id allocator

26 matches

Site Navigation

Mail list logo

Footer information