Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
Indeed, thanks to everyone who commented.  The extra conceptual
complexity and reduced readbility is Just Not Worth It.

v2 (and final, as far as I'm concerned) follows.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 10:23:53PM +0300, Andrey Abramov wrote:
> 15.03.2019, 22:06, "Andy Shevchenko" :
> > It makes harder to read and understand the code. One needs spend 
> > deciseconds /
> > seconds to catch instead of santiseconds of time.
> 
> If it is true for someone (for example, for me it is almost as readable
> as a simple size_t), then I agree with Geert who said: "use size_t 
> everywhere".

We already spent too much on bikeshedding here.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Andrey Abramov



15.03.2019, 22:06, "Andy Shevchenko" :
> It makes harder to read and understand the code. One needs spend deciseconds /
> seconds to catch instead of santiseconds of time.

If it is true for someone (for example, for me it is almost as readable
as a simple size_t), then I agree with Geert who said: "use size_t everywhere".

With Best Regards, Andrey Abramov.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Andy Shevchenko
On Fri, Mar 15, 2019 at 09:53:07PM +0300, Andrey Abramov wrote:
> > I'm trying to present the case to spur discussion, but it realy is
> > a *question* I'm asking about whether to do that, not a suggestion
> > phrased as a question.
> 
> > If it's just x86_64, use size_t everywhere, and let them suffer, for not
> > being real 64-bit ;-)
> 
> But what is the problem of local typedef with a good and clear comment?

It makes harder to read and understand the code. One needs spend deciseconds /
seconds to catch instead of santiseconds of time.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Andrey Abramov
> I'm trying to present the case to spur discussion, but it realy is
> a *question* I'm asking about whether to do that, not a suggestion
> phrased as a question.

> If it's just x86_64, use size_t everywhere, and let them suffer, for not
> being real 64-bit ;-)

But what is the problem of local typedef with a good and clear comment?


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Geert Uytterhoeven
Hi George,

On Fri, Mar 15, 2019 at 5:59 PM George Spelvin  wrote:
> On Fri, 15 Mar 2019 at 13:57:05 +0100, Geert Uytterhoeven wrote:
> > On Fri, Mar 15, 2019 at 11:23 AM George Spelvin  wrote:
> >> On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
> >>> On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
>  One question I should ask everyone: should "count" be 32 or 64 bits
>  on 64-bit machines?  That would let x86 save a few REX bytes.  (815
>  vs. 813 byte code, if anyone cares.)
> 
>  Allegedy ARM can save a few pJ by gating the high 32
>  bits of the ALU.
> 
>  Most other 64-bit processors would prefer 64-bit operations as
>  it saves masking operations.
> >
> > So just make it unsigned int, unconditionally.
>
> As I wrote originally (and quoted above), other 64-bit machines don't
> have 32-bit operations and prefer 64-bit operations because they don't
> require masking.  x86 (for historical compatibiity) and ARM (for power
> saving) are the ones that come to mind.
>
> I'm trying to present the case to spur discussion, but it realy is
> a *question* I'm asking about whether to do that, not a suggestion
> phrased as a question.

If it's just x86_64, use size_t everywhere, and let them suffer, for not
being real 64-bit ;-)

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
On Fri, 15 Mar 2019 at 13:57:05 +0100, Geert Uytterhoeven wrote:
> On Fri, Mar 15, 2019 at 11:23 AM George Spelvin  wrote:
>> On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
>>> On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
 One question I should ask everyone: should "count" be 32 or 64 bits
 on 64-bit machines?  That would let x86 save a few REX bytes.  (815
 vs. 813 byte code, if anyone cares.)

 Allegedy ARM can save a few pJ by gating the high 32
 bits of the ALU.

 Most other 64-bit processors would prefer 64-bit operations as
 it saves masking operations.
> 
> So just make it unsigned int, unconditionally.

As I wrote originally (and quoted above), other 64-bit machines don't
have 32-bit operations and prefer 64-bit operations because they don't
require masking.  x86 (for historical compatibiity) and ARM (for power
saving) are the ones that come to mind.

I'm trying to present the case to spur discussion, but it realy is
a *question* I'm asking about whether to do that, not a suggestion
phrased as a question.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Geert Uytterhoeven
Hi George,

On Fri, Mar 15, 2019 at 11:23 AM George Spelvin  wrote:
> On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
> > On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
> >> On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> >>> On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
>  +for (bit = 1; count & bit; bit <<= 1) {
>  +cur = merge(priv, (cmp_func)cmp, pending, cur);
>  +pending = pending->prev;  /* Untouched by merge() */
>   }
> >>>
> >>> Wouldn't be it the same to
> >>>
> >>>   bit = ffz(count);
> >>>   while (bit--) {
> >>>   ...
> >>>   }
> >>> ?
> >>>
> >>> Though I dunno which one is generating better code.
> >>
> >> One question I should ask everyone: should "count" be 32 or 64 bits
> >> on 64-bit machines?  That would let x86 save a few REX bytes.  (815
> >> vs. 813 byte code, if anyone cares.)
> >>
> >> Allegedy ARM can save a few pJ by gating the high 32
> >> bits of the ALU.
> >>
> >> Most other 64-bit processors would prefer 64-bit operations as
> >> it saves masking operations.
> >>
> >> If we never sort a list with more than 2^32 entries, it
> >> makes no difference.
> >>
> >> If we use a 32-bit count and we *do* sort a list with more than
> >> 2^32 entries, then it still sorts, but the performance degrades to
> >> O((n/2^32)^2).
> >>
> >> Just how often do we expect the kernel to face lists that long?
> >> (Note that the old code was O((n/2^20)^2).)
> >
> > Using size_t sounds most logical to me (argument of least surprise).
>
> Yes, it is the obvious solution, which is why that's my default choice.
>
> But a bit of thought shows that a list long enough to break a
> 32-bit implementation is beyond ridiculous.
>
> The list must be at least 3 * 2^32 elements long to make the sort
> merge non-optimally.  That's 1.5 * 2^37 bytes (192 GiB) of list_head
> structures alone; at least double that for any practical application.
> And 32 * 3 * 2^32 + (2 + 3) * 2^32 = 101 * 2^32 = 1.57 * 2^38
> compares.
>
> That seems like a lot but that's not the botteneck.  Each compare
> reads from a new list element, and pretty soon, they'll miss all
> caches and go to main memory.
>
> Since the memory locations are random, for any small subset of the
> list, you'll get only one element per cache line.  A 32 MiB L3
> cache is 2^19 cache lines (assuming 64B lines).  So merge levels
> 20 through 33 will go to main memory.
>
> That's (12 * 3 + 5) * 2^32 = 1.28 * 2^37 cache misses.  At 60 ns each (typical
> local DRAM access time on i7 & Xeon according to Intel), that's a
> hard minimum of 10565 seconds = 2h 56m 05s in one list_sort call.
>
> This is definitely the scale of problem where a mutithreaded sort is
> called for.
>
> It's *so* impossible that maybe it's worth trading that capability
> for a couple of bytes in the inner loop.
>
> >> In the code, I could do something like
> >>
> >> #ifdef CONFIG_X86_64
> >> /* Comment explaining why */
> >> typedef uint32_t count_t;
> >> #else
> >> typedef size_t count_t;
> >> #endif

So just make it unsigned int, unconditionally.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread George Spelvin
On Fri, 15 Mar 2019 at 09:20:58 +0100, Geert Uytterhoeven wrote:
> On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
>> On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
>>> On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
 +for (bit = 1; count & bit; bit <<= 1) {
 +cur = merge(priv, (cmp_func)cmp, pending, cur);
 +pending = pending->prev;  /* Untouched by merge() */
  }
>>>
>>> Wouldn't be it the same to
>>>
>>>   bit = ffz(count);
>>>   while (bit--) {
>>>   ...
>>>   }
>>> ?
>>>
>>> Though I dunno which one is generating better code.
>>
>> One question I should ask everyone: should "count" be 32 or 64 bits
>> on 64-bit machines?  That would let x86 save a few REX bytes.  (815
>> vs. 813 byte code, if anyone cares.)
>>
>> Allegedy ARM can save a few pJ by gating the high 32
>> bits of the ALU.
>>
>> Most other 64-bit processors would prefer 64-bit operations as
>> it saves masking operations.
>>
>> If we never sort a list with more than 2^32 entries, it
>> makes no difference.
>>
>> If we use a 32-bit count and we *do* sort a list with more than
>> 2^32 entries, then it still sorts, but the performance degrades to
>> O((n/2^32)^2).
>>
>> Just how often do we expect the kernel to face lists that long?
>> (Note that the old code was O((n/2^20)^2).)
>
> Using size_t sounds most logical to me (argument of least surprise).

Yes, it is the obvious solution, which is why that's my default choice.

But a bit of thought shows that a list long enough to break a
32-bit implementation is beyond ridiculous.

The list must be at least 3 * 2^32 elements long to make the sort
merge non-optimally.  That's 1.5 * 2^37 bytes (192 GiB) of list_head
structures alone; at least double that for any practical application.
And 32 * 3 * 2^32 + (2 + 3) * 2^32 = 101 * 2^32 = 1.57 * 2^38
compares.

That seems like a lot but that's not the botteneck.  Each compare
reads from a new list element, and pretty soon, they'll miss all
caches and go to main memory.

Since the memory locations are random, for any small subset of the
list, you'll get only one element per cache line.  A 32 MiB L3
cache is 2^19 cache lines (assuming 64B lines).  So merge levels
20 through 33 will go to main memory.

That's (12 * 3 + 5) * 2^32 = 1.28 * 2^37 cache misses.  At 60 ns each (typical
local DRAM access time on i7 & Xeon according to Intel), that's a
hard minimum of 10565 seconds = 2h 56m 05s in one list_sort call.

This is definitely the scale of problem where a mutithreaded sort is
called for.

It's *so* impossible that maybe it's worth trading that capability
for a couple of bytes in the inner loop.

>> In the code, I could do something like
>>
>> #ifdef CONFIG_X86_64
>> /* Comment explaining why */
>> typedef uint32_t count_t;
>> #else
>> typedef size_t count_t;
>> #endif
>>
>> ...
>> count_t count = 0;
>
> Using different types makes it more complex, e.g. to print the value
> in debug code.
> And adding more typedefs is frowned upon.

It's a *local* typedef, purely for the purpose of moving #ifdef clutter
out of the function declaration.  I agree that *global* typedefs are
discouraged.

As for debugging, that's a red herring; it's easy to cast to (size_t).

> Just my 0.02€.

Thank you for considering the issue!

>> I prefer the shorter _ints and _longs names, but this is just
>> not a hill I want to die on.
>
> Argument of least surprise: don't call something a duck if it's not
> guaranteed to behave like a duck.
>
> If I read "long", this triggers a warning flag in my head: be careful,
> this is 32-bit on 32-bit platforms, and 64-bit on 64-bit platforms.

Good point.  And "_longlong" or "_llong" is a bit ugly, too.


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-15 Thread Geert Uytterhoeven
Hi George,

On Fri, Mar 15, 2019 at 5:33 AM George Spelvin  wrote:
> On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> > On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
> >> +for (bit = 1; count & bit; bit <<= 1) {
> >> +cur = merge(priv, (cmp_func)cmp, pending, cur);
> >> +pending = pending->prev;  /* Untouched by merge() */
> >>  }
> >
> > Wouldn't be it the same to
> >
> >   bit = ffz(count);
> >   while (bit--) {
> >   ...
> >   }
> > ?
> >
> > Though I dunno which one is generating better code.
>
> One question I should ask everyone: should "count" be 32 or 64 bits
> on 64-bit machines?  That would let x86 save a few REX bytes.  (815
> vs. 813 byte code, if anyone cares.)
>
> Allegedy ARM can save a few pJ by gating the high 32
> bits of the ALU.
>
> Most other 64-bit processors would prefer 64-bit operations as
> it saves masking operations.
>
> If we never sort a list with more than 2^32 entries, it
> makes no difference.
>
> If we use a 32-bit count and we *do* sort a list with more than
> 2^32 entries, then it still sorts, but the performance degrades to
> O((n/2^32)^2).
>
> Just how often do we expect the kernel to face lists that long?
> (Note that the old code was O((n/2^20)^2).)

Using size_t sounds most logical to me (argument of least surprise).

> In the code, I could do something like
>
> #ifdef CONFIG_X86_64
> /* Comment explaining why */
> typedef uint32_t count_t;
> #else
> typedef size_t count_t;
> #endif
>
> ...
> count_t count = 0;

Using different types makes it more complex, e.g. to print the value
in debug code.
And adding more typedefs is frowned upon.

Just my 0.02€.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
>> +for (bit = 1; count & bit; bit <<= 1) {
>> +cur = merge(priv, (cmp_func)cmp, pending, cur);
>> +pending = pending->prev;  /* Untouched by merge() */
>>  }
>
> Wouldn't be it the same to
> 
>   bit = ffz(count);
>   while (bit--) {
>   ...
>   }
> ?
>
> Though I dunno which one is generating better code.

One question I should ask everyone: should "count" be 32 or 64 bits
on 64-bit machines?  That would let x86 save a few REX bytes.  (815
vs. 813 byte code, if anyone cares.)

Allegedy ARM can save a few pJ by gating the high 32
bits of the ALU.

Most other 64-bit processors would prefer 64-bit operations as
it saves masking operations.

If we never sort a list with more than 2^32 entries, it
makes no difference.

If we use a 32-bit count and we *do* sort a list with more than
2^32 entries, then it still sorts, but the performance degrades to
O((n/2^32)^2).

Just how often do we expect the kernel to face lists that long?
(Note that the old code was O((n/2^20)^2).)

In the code, I could do something like

#ifdef CONFIG_X86_64
/* Comment explaining why */
typedef uint32_t count_t;
#else
typedef size_t count_t;
#endif

...
count_t count = 0;


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-14 Thread George Spelvin
On Thu, 14 Mar 2019 at 11:10:41 +0200, Andy Shevchenko wrote:
> On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
>> +/* Do merges corresponding to set lsbits in count */
>
>> +for (bit = 1; count & bit; bit <<= 1) {
>> +cur = merge(priv, (cmp_func)cmp, pending, cur);
>> +pending = pending->prev;  /* Untouched by merge() */
>>  }
>
> Wouldn't be it the same to
>
>   bit = ffz(count);
>   while (bit--) {
>   ...
>   }
> ?
>
> Though I dunno which one is generating better code.

Yes, it's the same.  But count is an incrementing counter, so the
pattern of return values from ffz() is 01020103010201040102010301020105...,
which has mean value 1/2 + 1/4 + 1/8 + 1/16 +... = 1.

So spending one instruction on ffz() to save one instruction per loop
iteration is an even trade, and if the processor doesn't have an ffz()
instruction, it's a loss.

There's a third possible implementation:

>> +for (bit = count; bit & 1; bit >>= 1) {

...which works fine, too.  (It even saves a few bytes of code, so I
might switch to it.)  I used the form I did because my test code
verified that the length of the lists being merged equalled "bit".

The other forms don't have that property.


Thank you for looking at the code!


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-14 Thread Andy Shevchenko
On Tue, Mar 05, 2019 at 03:06:44AM +, George Spelvin wrote:
> Rather than a fixed-size array of pending sorted runs, use the ->prev
> links to keep track of things.  This reduces stack usage, eliminates
> some ugly overflow handling, and reduces the code size.
> 
> Also:
> * merge() no longer needs to handle NULL inputs, so simplify.
> * The same applies to merge_and_restore_back_links(), which is renamed
>   to the less ponderous merge_final().  (It's a static helper function,
>   so we don't need a super-descriptive name; comments will do.)
> 
> x86-64 code size 1086 -> 740 bytes (-346)

> + do {
> + size_t bit;
>   struct list_head *cur = list;
> +
> + /* Extract the head of "list" as a single-element list "cur" */
>   list = list->next;
>   cur->next = NULL;
>  
> + /* Do merges corresponding to set lsbits in count */

> + for (bit = 1; count & bit; bit <<= 1) {
> + cur = merge(priv, (cmp_func)cmp, pending, cur);
> + pending = pending->prev;  /* Untouched by merge() */
>   }

Wouldn't be it the same to

bit = ffz(count);
while (bit--) {
...
}
?

Though I dunno which one is generating better code.

> + /* And place the result at the head of "pending" */
> + cur->prev = pending;
> + pending = cur;
> + count++;
> + } while (list->next);
> +
> + /* Now merge together last element with all pending lists */
> + while (pending->prev) {
> + list = merge(priv, (cmp_func)cmp, pending, list);
> + pending = pending->prev;
>   }

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-10 Thread George Spelvin
Rasmus Villemoes wrote:
> On 05/03/2019 04.06, George Spelvin wrote:
>> + * (Actually, it is always called with @a being the element which was
>> + * originally first, so it is not necessary to to distinguish the @a < @b
>> + * and @a == @b cases; the return value may be a simple boolean.  But if
>> + * you ever *use* this freedom, be sure to update this comment to document
>> + * that code now depends on preserving this property!)
>
> This was and still is used at least by the block layer, and likely
> others as well. While 3110fc79606fb introduced a bunch of if() return -1
> else if () ... stuff, it still ends with a 0/1 result. Before
> 3110fc79606fb, it was even more obvious that this property was used.

Ah, thank you!  I actually read through every list_sort caller in
the kernel to see if I could find anywhere that used it and couldn't,
but I didn't study this code carefully enough to see that it does
in the last step.

Since someone *does* use this, I'll change the comment signiicantly.

> Grepping around shows that this could probably be used in more places,
> gaining a cycle or two per cmp callback, e.g. xfs_buf_cmp. But that's of
> course outside the scope of this series.

The one that misled me at first was _xfs_buf_obj_cmp, which returns 0/1,
but that's not used by list_sort().  xfs_buf_cmp returns -1/0/+1.

As you might see from the comment around the cmp_func typedef,
there are other things that could be cleaned up if we did a pass
over all the call sites.

(I'm almost tempted to tell the compiler than cmp_func is const,
since it's supposed to be independent of the pointer frobbing that
list_sort does, but then I remember Henry Spencer's maxim about
lying to the compiler.)


Re: [PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-10 Thread Rasmus Villemoes
On 05/03/2019 04.06, George Spelvin wrote:

>   * The comparison function @cmp must return a negative value if @a
>   * should sort before @b, and a positive value if @a should sort after
>   * @b. If @a and @b are equivalent, and their original relative
>   * ordering is to be preserved, @cmp must return 0.
> + *
> + * (Actually, it is always called with @a being the element which was
> + * originally first, so it is not necessary to to distinguish the @a < @b
> + * and @a == @b cases; the return value may be a simple boolean.  But if
> + * you ever *use* this freedom, be sure to update this comment to document
> + * that code now depends on preserving this property!)
>   */

This was and still is used at least by the block layer, and likely
others as well. While 3110fc79606fb introduced a bunch of if() return -1
else if () ... stuff, it still ends with a 0/1 result. Before
3110fc79606fb, it was even more obvious that this property was used. So
I agree that it is worth documenting this feature, both for users of
list_sort, but even more so for future refactorers of it - but you
probably want to change the wording somewhat.

Grepping around shows that this could probably be used in more places,
gaining a cycle or two per cmp callback, e.g. xfs_buf_cmp. But that's of
course outside the scope of this series.

Rasmus


[PATCH 4/5] lib/list_sort: Simplify and remove MAX_LIST_LENGTH_BITS

2019-03-08 Thread George Spelvin
Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)

x86-64 code size 1086 -> 740 bytes (-346)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Signed-off-by: George Spelvin 
---
 include/linux/list_sort.h |   1 +
 lib/list_sort.c   | 152 --
 2 files changed, 96 insertions(+), 57 deletions(-)

diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index ba79956e848d..20f178c24e9d 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -6,6 +6,7 @@
 
 struct list_head;
 
+__attribute__((nonnull(2,3)))
 void list_sort(void *priv, struct list_head *head,
   int (*cmp)(void *priv, struct list_head *a,
  struct list_head *b));
diff --git a/lib/list_sort.c b/lib/list_sort.c
index 85759928215b..e4819ef0426b 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -7,33 +7,47 @@
 #include 
 #include 
 
-#define MAX_LIST_LENGTH_BITS 20
+/*
+ * By declaring the compare function with the __pure attribute, we give
+ * the compiler more opportunity to optimize.  Ideally, we'd use this in
+ * the prototype of list_sort(), but that would involve a lot of churn
+ * at all call sites, so just cast the function pointer passed in.
+ */
+typedef int __pure __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+   struct list_head const *, struct list_head const *);
 
 /*
  * Returns a list organized in an intermediate format suited
  * to chaining of merge() calls: null-terminated, no reserved or
  * sentinel head node, "prev" links not maintained.
  */
-static struct list_head *merge(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
+__attribute__((nonnull(2,3,4)))
+static struct list_head *merge(void *priv, cmp_func cmp,
struct list_head *a, struct list_head *b)
 {
-   struct list_head head, *tail = 
+   struct list_head *head, **tail = 
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
-   tail->next = a;
+   if (cmp(priv, a, b) <= 0) {
+   *tail = a;
+   tail = >next;
a = a->next;
+   if (!a) {
+   *tail = b;
+   break;
+   }
} else {
-   tail->next = b;
+   *tail = b;
+   tail = >next;
b = b->next;
+   if (!b) {
+   *tail = a;
+   break;
+   }
}
-   tail = tail->next;
}
-   tail->next = a?:b;
-   return head.next;
+   return head;
 }
 
 /*
@@ -43,44 +57,52 @@ static struct list_head *merge(void *priv,
  * prev-link restoration pass, or maintaining the prev links
  * throughout.
  */
-static void merge_and_restore_back_links(void *priv,
-   int (*cmp)(void *priv, struct list_head *a,
-   struct list_head *b),
-   struct list_head *head,
-   struct list_head *a, struct list_head *b)
+__attribute__((nonnull(2,3,4,5)))
+static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
+   struct list_head *a, struct list_head *b)
 {
struct list_head *tail = head;
u8 count = 0;
 
-   while (a && b) {
+   for (;;) {
/* if equal, take 'a' -- important for sort stability */
-   if ((*cmp)(priv, a, b) <= 0) {
+   if (cmp(priv, a, b) <= 0) {
tail->next = a;
a->prev = tail;
+   tail = a;
a = a->next;
+   if (!a)
+   break;
} else {
tail->next = b;
b->prev = tail;
+   tail = b;
b = b->next;
+   if (!b) {
+   b = a;
+   break;
+   }
}
-   tail = tail->next;
}
-