Re: [PATCH] lz4: fix performance regressions

2017-02-14 Thread Sven Schmidt
Hey Eric,

On Sun, Feb 12, 2017 at 03:38:02PM -0800, Eric Biggers wrote:
> Hi Sven,
> 
> On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> >  /*-
> >   * Reading and writing into memory
> >   **/
> > +typedef union {
> > +   U16 u16;
> > +   U32 u32;
> > +   size_t uArch;
> > +} __packed unalign;
> > 
> > -static inline U16 LZ4_read16(const void *memPtr)
> > +static FORCE_INLINE __maybe_unused U16 LZ4_read16(const void *ptr)
> >  {
> > -   U16 val;
> > -
> > -   memcpy(, memPtr, sizeof(val));
> > -
> > -   return val;
> > +   return ((const unalign *)ptr)->u16;
> >  }
> > 
> > -static inline U32 LZ4_read32(const void *memPtr)
> > +static FORCE_INLINE __maybe_unused U32 LZ4_read32(const void *ptr)
> >  {
> > -   U32 val;
> > -
> > -   memcpy(, memPtr, sizeof(val));
> > -
> > -   return val;
> > +   return ((const unalign *)ptr)->u32;
> >  }
> > 
> > -static inline size_t LZ4_read_ARCH(const void *memPtr)
> > +static FORCE_INLINE __maybe_unused size_t LZ4_read_ARCH(const void *ptr)
> >  {
> > -   size_t val;
> > -
> > -   memcpy(, memPtr, sizeof(val));
> > -
> > -   return val;
> > +   return ((const unalign *)ptr)->uArch;
> >  }
> > 
> > -static inline void LZ4_write16(void *memPtr, U16 value)
> > +static FORCE_INLINE __maybe_unused void LZ4_write16(void *memPtr, U16 
> > value)
> >  {
> > -   memcpy(memPtr, , sizeof(value));
> > +   ((unalign *)memPtr)->u16 = value;
> >  }
> > 
> > -static inline void LZ4_write32(void *memPtr, U32 value)
> > -{
> > -   memcpy(memPtr, , sizeof(value));
> > +static FORCE_INLINE __maybe_unused void LZ4_write32(void *memPtr, U32 
> > value) {
> > +   ((unalign *)memPtr)->u32 = value;
> >  }
> > 
> > -static inline U16 LZ4_readLE16(const void *memPtr)
> > +static FORCE_INLINE __maybe_unused U16 LZ4_readLE16(const void *memPtr)
> >  {
> > -#ifdef __LITTLE_ENDIAN__
> > +#if LZ4_LITTLE_ENDIAN
> > return LZ4_read16(memPtr);
> >  #else
> > const BYTE *p = (const BYTE *)memPtr;
> > @@ -137,19 +143,19 @@ static inline U16 LZ4_readLE16(const void *memPtr)
> >  #endif
> >  }
> 
> Since upstream LZ4 is intended to be compiled at -O3, this may allow it to get
> away with using memcpy() for unaligned memory accesses.  The reason it uses
> memcpy() is that, other than a byte-by-byte copy, it is the only portable way 
> to
> express unaligned memory accesses.  But the Linux kernel is sometimes compiled
> optimized for size (-Os), and I wouldn't be *too* surprised if some of the
> memcpy()'s don't always get inlined then, which could be causing the 
> performance
> regression being observed.  (Of course, this could be verified by checking
> whether CONFIG_CC_OPTIMIZE_FOR_SIZE=y is set, then reading the assembly.)
> 
> But I don't think accessing a __packed structure directly is the right
> alternative.  Instead, Linux already includes macros for unaligned memory
> accesses which have been optimized for every supported architecture.  Those
> should just be used instead, e.g. like this:
> 
> static FORCE_INLINE U16 LZ4_read16(const void *ptr)
> {
>   return get_unaligned((const u16 *)ptr);
> }
> 
> static FORCE_INLINE U32 LZ4_read32(const void *ptr)
> {
>   return get_unaligned((const u32 *)ptr);
> }
> 
> static FORCE_INLINE size_t LZ4_read_ARCH(const void *ptr)
> {
>   return get_unaligned((const size_t *)ptr);
> }
> 
> static FORCE_INLINE void LZ4_write16(void *memPtr, U16 value)
> {
>   put_unaligned(value, (u16 *)memPtr);
> }
> 
> static FORCE_INLINE void LZ4_write32(void *memPtr, U32 value)
> {
>   put_unaligned(value, (u32 *)memPtr);
> }
> 
> static FORCE_INLINE U16 LZ4_readLE16(const void *memPtr)
> {
>   return get_unaligned_le16(memPtr);
> }
> 
> static FORCE_INLINE void LZ4_writeLE16(void *memPtr, U16 value)
> {
>   return put_unaligned_le16(value, memPtr);
> }
> 
> static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
> {
>   if (LZ4_64bits()) {
>   u64 a = get_unaligned((const u64 *)src);
>   put_unaligned(a, (u64 *)dst);
>   } else {
>   u32 a = get_unaligned((const u32 *)src);
>   u32 b = get_unaligned((const u32 *)src + 1);
>   put_unaligned(a, (u32 *)dst);
>   put_unaligned(b, (u32 *)dst + 1);
>   }
> }
> 
> 
> Note that I dropped __maybe_unused as it's not needed on inline functions.
> That should be done everywhere else the patch proposes to add it too.
> 
> > -#if LZ4_ARCH64
> > -#ifdef __BIG_ENDIAN__
> > -#define LZ4_NBCOMMONBYTES(val) (__builtin_clzll(val) >> 3)
> > +static FORCE_INLINE unsigned int LZ4_NbCommonBytes(register size_t val)
> > +{
> > +#if LZ4_LITTLE_ENDIAN
> > +#if LZ4_ARCH64 /* 64 Bits Little Endian */
> > +#if defined(LZ4_FORCE_SW_BITCOUNT)
> > +   static const int DeBruijnBytePos[64] = {
> > +   0, 0, 0, 0, 0, 1, 1, 2, 0, 3, 1, 3, 1, 4, 2, 7,
> > +   0, 2, 3, 6, 1, 5, 3, 5, 1, 3, 4, 4, 2, 5, 6, 7,
> > +   7, 0, 1, 2, 3, 

Re: [PATCH] lz4: fix performance regressions

2017-02-13 Thread Willy Tarreau
On Mon, Feb 13, 2017 at 12:53:49PM +0100, Sven Schmidt wrote:
> On Sun, Feb 12, 2017 at 10:41:17PM +0100, Willy Tarreau wrote:
> > On Sun, Feb 12, 2017 at 04:20:00PM +0100, Sven Schmidt wrote:
> > > On Sun, Feb 12, 2017 at 02:05:08PM +0100, Willy Tarreau wrote:
> > > > Hi Sven,
> > > > 
> > > > On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> > > > > Fix performance regressions compared to current kernel LZ4
> > > > 
> > > > Your patch contains mostly style cleanups which certainly are welcome
> > > > but make the whole patch hard to review. These cleanups would have been
> > > > better into a separate, preliminary patch IMHO.
> > > > 
> > > > Regards,
> > > > Willy
> > > 
> > > Hi Willy,
> > > 
> > > the problem was, I wanted to compare my version to the upstream LZ4 to 
> > > find bugs (as with my last patch version: wrong indentation in LZ4HC 
> > > in two for loops). But since the LZ4 code is a pain to read, I made 
> > > additional style cleanups "on the way".
> > 
> > Oh I can easily understand!
> > 
> > > Hope you can manage to review the patch though, because it is difficult 
> > > to separate the cleanups now.
> > 
> > When I need to split a patch into pieces, usually what I do is that I
> > revert it, re-apply it without committing, then "git add -p", validate
> > all the hunks to be taken as the first patch (ie here the cleanups),
> > commit, then commit the rest as a separate one. It seems to me that the
> > fix is in the last few hunks though I'm not sure yet.
> > 
> > Thanks,
> > Willy
> 
> Hi Willy,
> 
> I didn't know about this 'trick' until now. Thanks for sharing it! I gave it 
> a short try recently, that's really cool!
> 
> Since the problem discussed in this branch of this thread seems to be solved 
> (see Minchans E-Mail), I won't split the patches, though.
> Or is there an actual need for doing so? I will send an updated patchset 
> (containing these patches + the other ones suggested by Eric) later.

It's probably too late for this time, but keep it in mind for next time :-)

willy


Re: [PATCH] lz4: fix performance regressions

2017-02-13 Thread Sven Schmidt
On Sun, Feb 12, 2017 at 10:41:17PM +0100, Willy Tarreau wrote:
> On Sun, Feb 12, 2017 at 04:20:00PM +0100, Sven Schmidt wrote:
> > On Sun, Feb 12, 2017 at 02:05:08PM +0100, Willy Tarreau wrote:
> > > Hi Sven,
> > > 
> > > On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> > > > Fix performance regressions compared to current kernel LZ4
> > > 
> > > Your patch contains mostly style cleanups which certainly are welcome
> > > but make the whole patch hard to review. These cleanups would have been
> > > better into a separate, preliminary patch IMHO.
> > > 
> > > Regards,
> > > Willy
> > 
> > Hi Willy,
> > 
> > the problem was, I wanted to compare my version to the upstream LZ4 to find 
> > bugs (as with my last patch version: wrong indentation in LZ4HC 
> > in two for loops). But since the LZ4 code is a pain to read, I made 
> > additional style cleanups "on the way".
> 
> Oh I can easily understand!
> 
> > Hope you can manage to review the patch though, because it is difficult to 
> > separate the cleanups now.
> 
> When I need to split a patch into pieces, usually what I do is that I
> revert it, re-apply it without committing, then "git add -p", validate
> all the hunks to be taken as the first patch (ie here the cleanups),
> commit, then commit the rest as a separate one. It seems to me that the
> fix is in the last few hunks though I'm not sure yet.
> 
> Thanks,
> Willy

Hi Willy,

I didn't know about this 'trick' until now. Thanks for sharing it! I gave it a 
short try recently, that's really cool!

Since the problem discussed in this branch of this thread seems to be solved 
(see Minchans E-Mail), I won't split the patches, though.
Or is there an actual need for doing so? I will send an updated patchset 
(containing these patches + the other ones suggested by Eric) later.

Regards,

Sven


Re: [PATCH] lz4: fix performance regressions

2017-02-12 Thread Eric Biggers
Hi Sven,

On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
>  /*-
>   *   Reading and writing into memory
>   **/
> +typedef union {
> + U16 u16;
> + U32 u32;
> + size_t uArch;
> +} __packed unalign;
> 
> -static inline U16 LZ4_read16(const void *memPtr)
> +static FORCE_INLINE __maybe_unused U16 LZ4_read16(const void *ptr)
>  {
> - U16 val;
> -
> - memcpy(, memPtr, sizeof(val));
> -
> - return val;
> + return ((const unalign *)ptr)->u16;
>  }
> 
> -static inline U32 LZ4_read32(const void *memPtr)
> +static FORCE_INLINE __maybe_unused U32 LZ4_read32(const void *ptr)
>  {
> - U32 val;
> -
> - memcpy(, memPtr, sizeof(val));
> -
> - return val;
> + return ((const unalign *)ptr)->u32;
>  }
> 
> -static inline size_t LZ4_read_ARCH(const void *memPtr)
> +static FORCE_INLINE __maybe_unused size_t LZ4_read_ARCH(const void *ptr)
>  {
> - size_t val;
> -
> - memcpy(, memPtr, sizeof(val));
> -
> - return val;
> + return ((const unalign *)ptr)->uArch;
>  }
> 
> -static inline void LZ4_write16(void *memPtr, U16 value)
> +static FORCE_INLINE __maybe_unused void LZ4_write16(void *memPtr, U16 value)
>  {
> - memcpy(memPtr, , sizeof(value));
> + ((unalign *)memPtr)->u16 = value;
>  }
> 
> -static inline void LZ4_write32(void *memPtr, U32 value)
> -{
> - memcpy(memPtr, , sizeof(value));
> +static FORCE_INLINE __maybe_unused void LZ4_write32(void *memPtr, U32 value) 
> {
> + ((unalign *)memPtr)->u32 = value;
>  }
> 
> -static inline U16 LZ4_readLE16(const void *memPtr)
> +static FORCE_INLINE __maybe_unused U16 LZ4_readLE16(const void *memPtr)
>  {
> -#ifdef __LITTLE_ENDIAN__
> +#if LZ4_LITTLE_ENDIAN
>   return LZ4_read16(memPtr);
>  #else
>   const BYTE *p = (const BYTE *)memPtr;
> @@ -137,19 +143,19 @@ static inline U16 LZ4_readLE16(const void *memPtr)
>  #endif
>  }

Since upstream LZ4 is intended to be compiled at -O3, this may allow it to get
away with using memcpy() for unaligned memory accesses.  The reason it uses
memcpy() is that, other than a byte-by-byte copy, it is the only portable way to
express unaligned memory accesses.  But the Linux kernel is sometimes compiled
optimized for size (-Os), and I wouldn't be *too* surprised if some of the
memcpy()'s don't always get inlined then, which could be causing the performance
regression being observed.  (Of course, this could be verified by checking
whether CONFIG_CC_OPTIMIZE_FOR_SIZE=y is set, then reading the assembly.)

But I don't think accessing a __packed structure directly is the right
alternative.  Instead, Linux already includes macros for unaligned memory
accesses which have been optimized for every supported architecture.  Those
should just be used instead, e.g. like this:

static FORCE_INLINE U16 LZ4_read16(const void *ptr)
{
return get_unaligned((const u16 *)ptr);
}

static FORCE_INLINE U32 LZ4_read32(const void *ptr)
{
return get_unaligned((const u32 *)ptr);
}

static FORCE_INLINE size_t LZ4_read_ARCH(const void *ptr)
{
return get_unaligned((const size_t *)ptr);
}

static FORCE_INLINE void LZ4_write16(void *memPtr, U16 value)
{
put_unaligned(value, (u16 *)memPtr);
}

static FORCE_INLINE void LZ4_write32(void *memPtr, U32 value)
{
put_unaligned(value, (u32 *)memPtr);
}

static FORCE_INLINE U16 LZ4_readLE16(const void *memPtr)
{
return get_unaligned_le16(memPtr);
}

static FORCE_INLINE void LZ4_writeLE16(void *memPtr, U16 value)
{
return put_unaligned_le16(value, memPtr);
}

static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
{
if (LZ4_64bits()) {
u64 a = get_unaligned((const u64 *)src);
put_unaligned(a, (u64 *)dst);
} else {
u32 a = get_unaligned((const u32 *)src);
u32 b = get_unaligned((const u32 *)src + 1);
put_unaligned(a, (u32 *)dst);
put_unaligned(b, (u32 *)dst + 1);
}
}


Note that I dropped __maybe_unused as it's not needed on inline functions.
That should be done everywhere else the patch proposes to add it too.

> -#if LZ4_ARCH64
> -#ifdef __BIG_ENDIAN__
> -#define LZ4_NBCOMMONBYTES(val) (__builtin_clzll(val) >> 3)
> +static FORCE_INLINE unsigned int LZ4_NbCommonBytes(register size_t val)
> +{
> +#if LZ4_LITTLE_ENDIAN
> +#if LZ4_ARCH64 /* 64 Bits Little Endian */
> +#if defined(LZ4_FORCE_SW_BITCOUNT)
> + static const int DeBruijnBytePos[64] = {
> + 0, 0, 0, 0, 0, 1, 1, 2, 0, 3, 1, 3, 1, 4, 2, 7,
> + 0, 2, 3, 6, 1, 5, 3, 5, 1, 3, 4, 4, 2, 5, 6, 7,
> + 7, 0, 1, 2, 3, 3, 4, 6, 2, 6, 5, 5, 3, 4, 5, 6,
> + 7, 1, 2, 4, 6, 4, 4, 5, 7, 2, 6, 5, 7, 6, 7, 7
> + };
> +
> + return DeBruijnBytePos[((U64)((val & -(long long)val)
> + * 0x0218A392CDABBD3FULL)) >> 58];
>  #else
> -#define LZ4_NBCOMMONBYTES(val) (__builtin_ctzll(val) >> 3)
> 

Re: [PATCH] lz4: fix performance regressions

2017-02-12 Thread Willy Tarreau
On Sun, Feb 12, 2017 at 04:20:00PM +0100, Sven Schmidt wrote:
> On Sun, Feb 12, 2017 at 02:05:08PM +0100, Willy Tarreau wrote:
> > Hi Sven,
> > 
> > On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> > > Fix performance regressions compared to current kernel LZ4
> > 
> > Your patch contains mostly style cleanups which certainly are welcome
> > but make the whole patch hard to review. These cleanups would have been
> > better into a separate, preliminary patch IMHO.
> > 
> > Regards,
> > Willy
> 
> Hi Willy,
> 
> the problem was, I wanted to compare my version to the upstream LZ4 to find 
> bugs (as with my last patch version: wrong indentation in LZ4HC 
> in two for loops). But since the LZ4 code is a pain to read, I made 
> additional style cleanups "on the way".

Oh I can easily understand!

> Hope you can manage to review the patch though, because it is difficult to 
> separate the cleanups now.

When I need to split a patch into pieces, usually what I do is that I
revert it, re-apply it without committing, then "git add -p", validate
all the hunks to be taken as the first patch (ie here the cleanups),
commit, then commit the rest as a separate one. It seems to me that the
fix is in the last few hunks though I'm not sure yet.

Thanks,
Willy


Re: [PATCH] lz4: fix performance regressions

2017-02-12 Thread Sven Schmidt
On Sun, Feb 12, 2017 at 02:05:08PM +0100, Willy Tarreau wrote:
> Hi Sven,
> 
> On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> > Fix performance regressions compared to current kernel LZ4
> 
> Your patch contains mostly style cleanups which certainly are welcome
> but make the whole patch hard to review. These cleanups would have been
> better into a separate, preliminary patch IMHO.
> 
> Regards,
> Willy

Hi Willy,

the problem was, I wanted to compare my version to the upstream LZ4 to find 
bugs (as with my last patch version: wrong indentation in LZ4HC 
in two for loops). But since the LZ4 code is a pain to read, I made additional 
style cleanups "on the way".
Hope you can manage to review the patch though, because it is difficult to 
separate the cleanups now.
Please feel free to ask if you stumble upon something.

Greetings,

Sven 


Re: [PATCH] lz4: fix performance regressions

2017-02-12 Thread Willy Tarreau
Hi Sven,

On Sun, Feb 12, 2017 at 12:16:18PM +0100, Sven Schmidt wrote:
> Fix performance regressions compared to current kernel LZ4

Your patch contains mostly style cleanups which certainly are welcome
but make the whole patch hard to review. These cleanups would have been
better into a separate, preliminary patch IMHO.

Regards,
Willy