Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-22 Thread John Naylor
On Wed, Jul 21, 2021 at 8:08 PM Thomas Munro  wrote:
>
> On Thu, Jul 22, 2021 at 6:16 AM John Naylor

> One question is whether this "one size fits all" approach will be
> extensible to wider SIMD.

Sure, it'll just take a little more work and complexity. For one, 16-byte
SIMD can operate on 32-byte chunks with a bit of repetition:

-   __m128i input;
+   __m128i input1;
+   __m128i input2;

-#define SIMD_STRIDE_LENGTH (sizeof(__m128i))
+#define SIMD_STRIDE_LENGTH 32

while (len >= SIMD_STRIDE_LENGTH)
{
-   input = vload(s);
+   input1 = vload(s);
+   input2 = vload(s + sizeof(input1));

-   check_for_zeros(input, );
+   check_for_zeros(input1, );
+   check_for_zeros(input2, );

/*
 * If the chunk is all ASCII, we can skip the full UTF-8
check, but we
@@ -460,17 +463,18 @@ pg_validate_utf8_sse42(const unsigned char *s, int
len)
 * sequences at the end. We only update prev_incomplete if
the chunk
 * contains non-ASCII, since the error is cumulative.
 */
-   if (is_highbit_set(input))
+   if (is_highbit_set(bitwise_or(input1, input2)))
{
-   check_utf8_bytes(prev, input, );
-   prev_incomplete = is_incomplete(input);
+   check_utf8_bytes(prev, input1, );
+   check_utf8_bytes(input1, input2, );
+   prev_incomplete = is_incomplete(input2);
}
else
{
error = bitwise_or(error, prev_incomplete);
}

-   prev = input;
+   prev = input2;
s += SIMD_STRIDE_LENGTH;
len -= SIMD_STRIDE_LENGTH;
}

So with a few #ifdefs, we can accommodate two sizes if we like.

For another, the prevN() functions would need to change, at least on x86 --
that would require replacing _mm_alignr_epi8() with _mm256_alignr_epi8()
plus _mm256_permute2x128_si256(). Also, we might have to do something with
the vector typedef.

That said, I think we can punt on that until we have an application that's
much more compute-intensive. As it is with SSE4, COPY FROM WHERE  already pushes the utf8 validation way down in profiles.

> FWIW here are some performance results from my humble RPI4:
>
> master:
>
>  chinese | mixed | ascii
> -+---+---
> 4172 |  2763 |  1823
> (1 row)
>
> Your v15 patch:
>
>  chinese | mixed | ascii
> -+---+---
> 2267 |  1248 |   399
> (1 row)
>
> Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:
>
>  chinese | mixed | ascii
> -+---+---
>  909 |   620 |   318
> (1 row)
>
> It's so good I wonder if it's producing incorrect results :-)

Nice! If it passes regression tests, it *should* be fine, but stress
testing would be welcome on any platform.

> I also tried to do a quick and dirty AltiVec patch to see if it could
> fit into the same code "shape", with less immediate success: it works
> out slower than the fallback code on the POWER7 machine I scrounged an
> account on.  I'm not sure what's wrong there, but maybe it's a uesful
> start (I'm probably confused about endianness, or the encoding of
> boolean vectors which may be different (is true 0x01or 0xff, does it
> matter?), or something else, and it's falling back on errors all the
> time?).

Hmm, I have access to a power8 machine to play with, but I also don't mind
having some type of server-class hardware that relies on the recent nifty
DFA fallback, which performs even better on powerpc64le than v15.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread Thomas Munro
On Thu, Jul 22, 2021 at 6:16 AM John Naylor
 wrote:
> Neat! It's good to make it more architecture-agnostic, and I'm sure we can 
> use quite a bit of this.

One question is whether this "one size fits all" approach will be
extensible to wider SIMD.

>  to_bool(const pg_u8x16_t v)
>  {
> +#if defined(USE_NEON)
> + return vmaxvq_u32((uint32x4_t) v) != 0;
>
> --> return vmaxvq_u8(*this) != 0;

I chose that lane width because I saw an unsubstantiated claim
somewhere that it might be faster, but I have no idea if it matters.
The u8 code looks more natural anyway.  Changed.

>  vzero()
>  {
> +#if defined(USE_NEON)
> + return vmovq_n_u8(0);
>
> --> return vdupq_n_u8(0); // or equivalently, splat(0)

I guess it doesn't make a difference which builtin you use here, but I
was influenced by the ARM manual which says the vdupq form is
generated for immediate values.

> is_highbit_set(const pg_u8x16_t v)
>  {
> +#if defined(USE_NEON)
> + return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));
>
> --> return vmaxq_u8(v) > 0x7F

Ah, of course.  Much nicer!

> +#if defined(USE_NEON)
> +static pg_attribute_always_inline pg_u8x16_t
> +vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
> + uint8 v4, uint8 v5, uint8 v6, uint8 v7,
> + uint8 v8, uint8 v9, uint8 v10, uint8 v11,
> + uint8 v12, uint8 v13, uint8 v14, uint8 v15)
> +{
> + uint8 pg_attribute_aligned(16) values[16] = {
> + v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
> + };
> + return vld1q_u8(values);
> +}
>
> --> They have this strange beast instead:
>
>   // Doing a load like so end ups generating worse code.
>   // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
>   // x9, x10,x11,x12,x13,x14,x15,x16};
>   // return vld1q_u8(array);
>   uint8x16_t x{};
>   // incredibly, Visual Studio does not allow x[0] = x1
>   x = vsetq_lane_u8(x1, x, 0);
>   x = vsetq_lane_u8(x2, x, 1);
>   x = vsetq_lane_u8(x3, x, 2);
> ...
>   x = vsetq_lane_u8(x15, x, 14);
>   x = vsetq_lane_u8(x16, x, 15);
>   return x;
>
> Since you aligned the array, that might not have the problem alluded to 
> above, and it looks nicer.

Strange indeed.  We should probably poke around in the assember and
see... it might be that MSVC doesn't like it, and I was just
cargo-culting the alignment.  I don't expect the generated code to
really "load" anything of course, it should ideally be some kind of
immediate mov...

FWIW here are some performance results from my humble RPI4:

master:

 chinese | mixed | ascii
-+---+---
4172 |  2763 |  1823
(1 row)

Your v15 patch:

 chinese | mixed | ascii
-+---+---
2267 |  1248 |   399
(1 row)

Your v15 patch set + the NEON patch, configured with USE_UTF8_SIMD=1:

 chinese | mixed | ascii
-+---+---
 909 |   620 |   318
(1 row)

It's so good I wonder if it's producing incorrect results :-)

I also tried to do a quick and dirty AltiVec patch to see if it could
fit into the same code "shape", with less immediate success: it works
out slower than the fallback code on the POWER7 machine I scrounged an
account on.  I'm not sure what's wrong there, but maybe it's a uesful
start (I'm probably confused about endianness, or the encoding of
boolean vectors which may be different (is true 0x01or 0xff, does it
matter?), or something else, and it's falling back on errors all the
time?).
From 1464c22da33117900341496d03b92dec5be2a62a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 22 Jul 2021 02:05:06 +1200
Subject: [PATCH v2 1/3] XXX Make SIMD code more platform neutral.

Move SIMD code into pg_utf_simd.c to experiment with the idea of a
shared implementation across architectures.  Introduce pg_u8x16_t to
abstract vector type.

XXX Experiment grade code only
---
 configure |  18 +--
 configure.ac  |  18 +--
 src/include/pg_config.h.in|  14 +-
 src/include/port/pg_utf8.h|  10 +-
 src/port/Makefile |   8 +-
 ...g_utf8_sse42_choose.c => pg_utf8_choose.c} |  10 +-
 src/port/{pg_utf8_sse42.c => pg_utf8_simd.c}  | 148 +-
 7 files changed, 114 insertions(+), 112 deletions(-)
 rename src/port/{pg_utf8_sse42_choose.c => pg_utf8_choose.c} (88%)
 rename src/port/{pg_utf8_sse42.c => pg_utf8_simd.c} (76%)

diff --git a/configure b/configure
index 30969840b1..df546b641c 100755
--- a/configure
+++ b/configure
@@ -18442,13 +18442,13 @@ fi
 #
 # You can override this logic by setting the appropriate USE_*_UTF8 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_UTF8" = x"" && test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" 
= x"" && test x"$USE_FALLBACK_UTF8" = x""; then
+if test x"$USE_SIMD_UTF8" = x"" && test x"$USE_SIMD_UTF8_WITH_RUNTIME_CHECK" = 
x"" && test x"$USE_FALLBACK_UTF8" = x""; then
   if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = 
x"1" ; then
-USE_SSE42_UTF8=1
+USE_SIMD_UTF8=1
  

Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread John Naylor
On Wed, Jul 21, 2021 at 11:29 AM Thomas Munro 
wrote:

> Just for fun/experimentation, here's a quick (and probably too naive)
> translation of those helper functions to NEON, on top of the v15
> patch.

Neat! It's good to make it more architecture-agnostic, and I'm sure we can
use quite a bit of this. I don't know enough about NEON to comment
intelligently, but a quick glance through the simdjson source show a couple
differences that might be worth a look:

 to_bool(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return vmaxvq_u32((uint32x4_t) v) != 0;

--> return vmaxvq_u8(*this) != 0;

 vzero()
 {
+#if defined(USE_NEON)
+ return vmovq_n_u8(0);

--> return vdupq_n_u8(0); // or equivalently, splat(0)

is_highbit_set(const pg_u8x16_t v)
 {
+#if defined(USE_NEON)
+ return to_bool(bitwise_and(v, vmovq_n_u8(0x80)));

--> return vmaxq_u8(v) > 0x7F

(Technically, their convention is: is_ascii(v) { return vmaxq_u8(v) < 0x80;
} , but same effect)

+#if defined(USE_NEON)
+static pg_attribute_always_inline pg_u8x16_t
+vset(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
+ uint8 v4, uint8 v5, uint8 v6, uint8 v7,
+ uint8 v8, uint8 v9, uint8 v10, uint8 v11,
+ uint8 v12, uint8 v13, uint8 v14, uint8 v15)
+{
+ uint8 pg_attribute_aligned(16) values[16] = {
+ v0, v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15
+ };
+ return vld1q_u8(values);
+}

--> They have this strange beast instead:

  // Doing a load like so end ups generating worse code.
  // uint8_t array[16] = {x1, x2, x3, x4, x5, x6, x7, x8,
  // x9, x10,x11,x12,x13,x14,x15,x16};
  // return vld1q_u8(array);
  uint8x16_t x{};
  // incredibly, Visual Studio does not allow x[0] = x1
  x = vsetq_lane_u8(x1, x, 0);
  x = vsetq_lane_u8(x2, x, 1);
  x = vsetq_lane_u8(x3, x, 2);
...
  x = vsetq_lane_u8(x15, x, 14);
  x = vsetq_lane_u8(x16, x, 15);
  return x;

Since you aligned the array, that might not have the problem alluded to
above, and it looks nicer.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-07-21 Thread Thomas Munro
On Sat, Mar 13, 2021 at 4:37 AM John Naylor
 wrote:
> On Fri, Mar 12, 2021 at 9:14 AM Amit Khandekar  wrote:
> > I was not thinking about auto-vectorizing the code in
> > pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
> > inside the individual helper functions that you wrote, such as
> > _mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),
>
> If the PhD holders who came up with this algorithm thought it possible to do 
> it that way, I'm sure they would have. In reality, simdjson has different 
> files for SSE4, AVX, AVX512, NEON, and Altivec. We can incorporate any of 
> those as needed. That's a PG15 project, though, and I'm not volunteering.

Just for fun/experimentation, here's a quick (and probably too naive)
translation of those helper functions to NEON, on top of the v15
patch.
From 1464c22da33117900341496d03b92dec5be2a62a Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 22 Jul 2021 02:05:06 +1200
Subject: [PATCH 1/2] XXX Make SIMD code more platform neutral.

Move SIMD code into pg_utf_simd.c to experiment with the idea of a
shared implementation across architectures.  Introduce pg_u8x16_t to
abstract vector type.

XXX Experiment grade code only
---
 configure |  18 +--
 configure.ac  |  18 +--
 src/include/pg_config.h.in|  14 +-
 src/include/port/pg_utf8.h|  10 +-
 src/port/Makefile |   8 +-
 ...g_utf8_sse42_choose.c => pg_utf8_choose.c} |  10 +-
 src/port/{pg_utf8_sse42.c => pg_utf8_simd.c}  | 148 +-
 7 files changed, 114 insertions(+), 112 deletions(-)
 rename src/port/{pg_utf8_sse42_choose.c => pg_utf8_choose.c} (88%)
 rename src/port/{pg_utf8_sse42.c => pg_utf8_simd.c} (76%)

diff --git a/configure b/configure
index 30969840b1..df546b641c 100755
--- a/configure
+++ b/configure
@@ -18442,13 +18442,13 @@ fi
 #
 # You can override this logic by setting the appropriate USE_*_UTF8 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_UTF8" = x"" && test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" 
= x"" && test x"$USE_FALLBACK_UTF8" = x""; then
+if test x"$USE_SIMD_UTF8" = x"" && test x"$USE_SIMD_UTF8_WITH_RUNTIME_CHECK" = 
x"" && test x"$USE_FALLBACK_UTF8" = x""; then
   if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = 
x"1" ; then
-USE_SSE42_UTF8=1
+USE_SIMD_UTF8=1
   else
 # the CPUID instruction is needed for the runtime check.
 if test x"$pgac_sse42_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" 
= x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-  USE_SSE42_UTF8_WITH_RUNTIME_CHECK=1
+  USE_SIMD_UTF8_WITH_RUNTIME_CHECK=1
 else
   # fall back to algorithm which doesn't require any special
   # CPU support.
@@ -18461,19 +18461,19 @@ fi
 # Note: We need the fallback for error handling in all builds.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking which UTF-8 validator to 
use" >&5
 $as_echo_n "checking which UTF-8 validator to use... " >&6; }
-if test x"$USE_SSE42_UTF8" = x"1"; then
+if test x"$USE_SIMD_UTF8" = x"1"; then
 
-$as_echo "#define USE_SSE42_UTF8 1" >>confdefs.h
+$as_echo "#define USE_SIMD_UTF8 1" >>confdefs.h
 
-  PG_UTF8_OBJS="pg_utf8_sse42.o pg_utf8_fallback.o"
+  PG_UTF8_OBJS="pg_utf8_simd.o pg_utf8_fallback.o"
   { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2" >&5
 $as_echo "SSE 4.2" >&6; }
 else
-  if test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" = x"1"; then
+  if test x"$USE_SIMD_UTF8_WITH_RUNTIME_CHECK" = x"1"; then
 
-$as_echo "#define USE_SSE42_UTF8_WITH_RUNTIME_CHECK 1" >>confdefs.h
+$as_echo "#define USE_SIMD_UTF8_WITH_RUNTIME_CHECK 1" >>confdefs.h
 
-PG_UTF8_OBJS="pg_utf8_sse42.o pg_utf8_fallback.o pg_utf8_sse42_choose.o"
+PG_UTF8_OBJS="pg_utf8_simd.o pg_utf8_fallback.o pg_utf8_choose.o"
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: SSE 4.2 with runtime 
check" >&5
 $as_echo "SSE 4.2 with runtime check" >&6; }
   else
diff --git a/configure.ac b/configure.ac
index 5e2b4717c1..1606a80fb7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2217,13 +2217,13 @@ AC_SUBST(PG_CRC32C_OBJS)
 #
 # You can override this logic by setting the appropriate USE_*_UTF8 flag to 1
 # in the template or configure command line.
-if test x"$USE_SSE42_UTF8" = x"" && test x"$USE_SSE42_UTF8_WITH_RUNTIME_CHECK" 
= x"" && test x"$USE_FALLBACK_UTF8" = x""; then
+if test x"$USE_SIMD_UTF8" = x"" && test x"$USE_SIMD_UTF8_WITH_RUNTIME_CHECK" = 
x"" && test x"$USE_FALLBACK_UTF8" = x""; then
   if test x"$pgac_sse42_intrinsics" = x"yes" && test x"$SSE4_2_TARGETED" = 
x"1" ; then
-USE_SSE42_UTF8=1
+USE_SIMD_UTF8=1
   else
 # the CPUID instruction is needed for the runtime check.
 if test x"$pgac_sse42_intrinsics" = x"yes" && (test x"$pgac_cv__get_cpuid" 
= x"yes" || test x"$pgac_cv__cpuid" = x"yes"); then
-  USE_SSE42_UTF8_WITH_RUNTIME_CHECK=1
+  USE_SIMD_UTF8_WITH_RUNTIME_CHECK=1
  

Re: [POC] verifying UTF-8 using SIMD instructions

2021-04-01 Thread John Naylor
v9 is just a rebase.

--
John Naylor
EDB: http://www.enterprisedb.com
From e876049ad3b153e8725ab23f65ae8f021a970470 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 1 Apr 2021 08:24:05 -0400
Subject: [PATCH v9] Replace pg_utf8_verifystr() with two faster
 implementations:

On x86-64, we use SSE 4.1 with a lookup algorithm based on "Validating
UTF-8 In Less Than One Instruction Per Byte" by John Keiser and Daniel
Lemire. Since configure already tests for SSE 4.2 for CRC, we piggy-back
on top of that. The lookup tables are taken from the simdjson library
(Apache 2.0 licensed), but the code is written from scratch using
simdjson as a reference.

On other platforms, we still get a performance boost by using a bespoke
fallback function, rather than one that relies on pg_utf8_verifychar()
and pg_utf8_isvalid(). This one is loosely based on the fallback that
is part of the simdjson library.
---
 config/c-compiler.m4 |  28 +-
 configure| 114 +++--
 configure.ac |  61 ++-
 src/Makefile.global.in   |   3 +
 src/common/wchar.c   |  27 +-
 src/include/pg_config.h.in   |   9 +
 src/include/port/pg_utf8.h   |  86 
 src/port/Makefile|   6 +
 src/port/pg_utf8_fallback.c  | 129 ++
 src/port/pg_utf8_sse42.c | 537 +++
 src/port/pg_utf8_sse42_choose.c  |  68 +++
 src/test/regress/expected/conversion.out |  52 +++
 src/test/regress/sql/conversion.sql  |  28 ++
 src/tools/msvc/Mkvcbuild.pm  |   4 +
 src/tools/msvc/Solution.pm   |   3 +
 15 files changed, 1088 insertions(+), 67 deletions(-)
 create mode 100644 src/include/port/pg_utf8.h
 create mode 100644 src/port/pg_utf8_fallback.c
 create mode 100644 src/port/pg_utf8_sse42.c
 create mode 100644 src/port/pg_utf8_sse42_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 780e906ecc..b1604eac58 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -591,36 +591,46 @@ if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
   AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have __atomic_compare_exchange_n(int64 *, int64 *, int64).])
 fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS
 
-# PGAC_SSE42_CRC32_INTRINSICS
+# PGAC_SSE42_INTRINSICS
 # ---
 # Check if the compiler supports the x86 CRC instructions added in SSE 4.2,
 # using the _mm_crc32_u8 and _mm_crc32_u32 intrinsic functions. (We don't
 # test the 8-byte variant, _mm_crc32_u64, but it is assumed to be present if
 # the other ones are, on x86-64 platforms)
 #
+# While at it, check for support x86 instructions added in SSSE3 and SSE4.1,
+# in particular _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128.
+# We should be able to assume these are understood by the compiler if CRC
+# intrinsics are, but it's better to document our reliance on them here.
+#
+# We don't test for SSE2 intrinsics, as they are assumed to be present on
+# x86-64 platforms, which we can easily check at compile time.
+#
 # An optional compiler flag can be passed as argument (e.g. -msse4.2). If the
-# intrinsics are supported, sets pgac_sse42_crc32_intrinsics, and CFLAGS_SSE42.
-AC_DEFUN([PGAC_SSE42_CRC32_INTRINSICS],
-[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_crc32_intrinsics_$1])])dnl
-AC_CACHE_CHECK([for _mm_crc32_u8 and _mm_crc32_u32 with CFLAGS=$1], [Ac_cachevar],
+# intrinsics are supported, sets pgac_sse42_intrinsics, and CFLAGS_SSE42.
+AC_DEFUN([PGAC_SSE42_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sse42_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for for _mm_crc32_u8, _mm_crc32_u32, _mm_alignr_epi8, _mm_shuffle_epi8, and _mm_testz_si128 with CFLAGS=$1], [Ac_cachevar],
 [pgac_save_CFLAGS=$CFLAGS
 CFLAGS="$pgac_save_CFLAGS $1"
 AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
   [unsigned int crc = 0;
crc = _mm_crc32_u8(crc, 0);
crc = _mm_crc32_u32(crc, 0);
+   __m128i vec = _mm_set1_epi8(crc);
+   vec = _mm_shuffle_epi8(vec,
+ _mm_alignr_epi8(vec, vec, 1));
/* return computed value, to prevent the above being optimized away */
-   return crc == 0;])],
+   return _mm_testz_si128(vec, vec);])],
   [Ac_cachevar=yes],
   [Ac_cachevar=no])
 CFLAGS="$pgac_save_CFLAGS"])
 if test x"$Ac_cachevar" = x"yes"; then
   CFLAGS_SSE42="$1"
-  pgac_sse42_crc32_intrinsics=yes
+  pgac_sse42_intrinsics=yes
 fi
 undefine([Ac_cachevar])dnl
-])# PGAC_SSE42_CRC32_INTRINSICS
-
+])# PGAC_SSE42_INTRINSICS
 
 # PGAC_ARMV8_CRC32C_INTRINSICS
 # 
diff --git a/configure b/configure
index 06ad9aeb71..4d70f10fab 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
 MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
+PG_UTF8_OBJS
 PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
@@ -17963,14 +17964,14 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
 
 fi
 
-# Check for Intel SSE 4.2 intrinsics to do CRC 

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread John Naylor
On Fri, Mar 12, 2021 at 9:14 AM Amit Khandekar 
wrote:
>
> On my Arm64 VM :
>
> HEAD :
>  mixed | ascii
> ---+---
>   1091 |   628
> (1 row)
>
> PATCHED :
>  mixed | ascii
> ---+---
>681 |   119

Thanks for testing! Good, the speedup is about as much as I can hope for
using plain C. In the next patch I'll go ahead and squash in the ascii fast
path, using 16-byte stride, unless there are objections. I claim we can
live with the regression Heikki found on an old 32-bit Arm platform since
it doesn't seem to be true of Arm in general.

> I guess, if at all we use the equivalent Arm NEON intrinsics, the
> "mixed" figures will be close to the "ascii" figures, going by your
> figures on x86.

I would assume so.

> I was not thinking about auto-vectorizing the code in
> pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
> inside the individual helper functions that you wrote, such as
> _mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),

If the PhD holders who came up with this algorithm thought it possible to
do it that way, I'm sure they would have. In reality, simdjson has
different files for SSE4, AVX, AVX512, NEON, and Altivec. We can
incorporate any of those as needed. That's a PG15 project, though, and I'm
not volunteering.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread Amit Khandekar
On Tue, 9 Mar 2021 at 17:14, John Naylor  wrote:
> On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar  wrote:
> > Just a quick question before I move on to review the patch ... The
> > improvement looks like it is only meant for x86 platforms.
>
> Actually it's meant to be faster for all platforms, since the C fallback is 
> quite a bit different from HEAD. I've found it to be faster on ppc64le. An 
> earlier version of the patch was a loser on 32-bit Arm because of alignment 
> issues, but if you could run the test script attached to [1] on 64-bit Arm, 
> I'd be curious to see how it does on 0002, and whether 0003 and 0004 make 
> things better or worse. If there is trouble building on non-x86 platforms, 
> I'd want to fix that also.

On my Arm64 VM :

HEAD :
 mixed | ascii
---+---
  1091 |   628
(1 row)

PATCHED :
 mixed | ascii
---+---
   681 |   119

So the fallback function does show improvements on Arm64.

I guess, if at all we use the equivalent Arm NEON intrinsics, the
"mixed" figures will be close to the "ascii" figures, going by your
figures on x86.

> > Can this be
> > done in a portable way by arranging for auto-vectorization ? Something
> > like commit 88709176236caf. This way it would benefit other platforms
> > as well.
>
> I'm fairly certain that the author of a compiler capable of doing that in 
> this case would be eligible for some kind of AI prize. :-)

:)

I was not thinking about auto-vectorizing the code in
pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
inside the individual helper functions that you wrote, such as
_mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),
saturating_sub() etc. I myself am not sure whether it is feasible to
write code that auto-vectorizes all these function definitions.
saturating_sub() seems hard, but I could see the gcc docs mentioning
support for generating such instructions for a particular code loop.
But for the index lookup function() it seems impossible to generate
the  needed index lookup intrinsics. We can have platform-specific
function definitions for such exceptional cases.

I am considering this only because that would make the exact code work
on other platforms like arm64 and ppc, and won't have to have
platform-specific files. But I understand that it is easier said than
done. We will have to process the loop in pg_validate_utf8_sse42() in
128-bit chunks, and pass each chunk to individual functions, which
could mean extra work and extra copy in extracting the chunk data and
passing it around, which may make things drastically slow. You are
passing around the chunks using __m128i type, so perhaps it means
passing around just a reference to the simd registers. Not sure.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread John Naylor
On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar 
wrote:
>
> Hi,
>
> Just a quick question before I move on to review the patch ... The
> improvement looks like it is only meant for x86 platforms.

Actually it's meant to be faster for all platforms, since the C fallback is
quite a bit different from HEAD. I've found it to be faster on ppc64le. An
earlier version of the patch was a loser on 32-bit Arm because of alignment
issues, but if you could run the test script attached to [1] on 64-bit Arm,
I'd be curious to see how it does on 0002, and whether 0003 and 0004 make
things better or worse. If there is trouble building on non-x86 platforms,
I'd want to fix that also.

(Note: 0001 is not my patch, and I just include it for the tests)

> Can this be
> done in a portable way by arranging for auto-vectorization ? Something
> like commit 88709176236caf. This way it would benefit other platforms
> as well.

I'm fairly certain that the author of a compiler capable of doing that in
this case would be eligible for some kind of AI prize. :-)

[1]
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5...@iki.fi
--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread Amit Khandekar
Hi,

Just a quick question before I move on to review the patch ... The
improvement looks like it is only meant for x86 platforms. Can this be
done in a portable way by arranging for auto-vectorization ? Something
like commit 88709176236caf. This way it would benefit other platforms
as well.

I tried to compile the following code using -O3, and the assembly does
have vectorized instructions.

#include 
int main()
{
int i;
char s1[200] = "abcdewhruerhetr";
char s2[200] = "oweurietiureuhtrethre";
char s3[200] = {0};

for (i = 0; i < sizeof(s1); i++)
{
s3[i] = s1[i] ^ s2[i];
}

printf("%s\n", s3);
}




Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-18 Thread John Naylor
On Mon, Feb 15, 2021 at 9:32 PM John Naylor 
wrote:
>
> On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas 
wrote:
> >
> > I'm guessing that's because the unaligned access in check_ascii() is
> > expensive on this platform.

> Some possible remedies:

> 3) #ifdef out the ascii check for 32-bit platforms.

> 4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time.
I'll probably try this first.

I've attached a couple patches to try on top of v4; maybe they'll help the
Arm32 regression. 01 reduces the stride to 8 bytes, and 02 applies on top
of v1 to disable the fallback fast path entirely on 32-bit platforms. A bit
of a heavy hammer, but it'll confirm (or not) your theory about unaligned
loads.

Also, I've included patches to explain more fully how I modeled non-UTF-8
performance while still using the UTF-8 tests. I think it was a useful
thing to do, and I have a theory that might predict how a non-UTF8 encoding
will perform with the fast path.

03A and 03B are independent of each other and conflict, but both apply on
top of v4 (don't need 02). Both replace the v4 fallback with the ascii
fastpath + pg_utf8_verifychar() in the loop, similar to utf-8 on master.
03A has a local static copy of pg_utf8_islegal(), and 03B uses the existing
global function. (On x86, you can disable SSE4 by passing
USE_FALLBACK_UTF8=1 to configure.)

While Clang 10 regressed for me on pure multibyte in a similar test
upthread, on Linux gcc 8.4 there isn't a regression at all. IIRC, gcc
wasn't as good as Clang when the API changed a few weeks ago, so its
regression from v4 is still faster than master. Clang only regressed with
my changes because it somehow handled master much better to begin with.

x86-64 Linux gcc 8.4

master

 chinese | mixed | ascii
-+---+---
1453 |   857 |   428

v4 (fallback verifier written as a single function)

 chinese | mixed | ascii
-+---+---
 815 |   514 |82

v4 plus addendum 03A -- emulate non-utf-8 using a copy of
pg_utf8_is_legal() as a static function

 chinese | mixed | ascii
-+---+---
1115 |   547 |87

v4 plus addendum 03B -- emulate non-utf-8 using pg_utf8_is_legal() as a
global function

 chinese | mixed | ascii
-+---+---
1279 |   604 |82

(I also tried the same on ppc64le Linux, gcc 4.8.5 and while not great, it
never got worse than master either on pure multibyte.)

This is supposed to model the performance of a non-utf8 encoding, where we
don't have a bespoke function written from scratch. Here's my theory: If an
encoding has pg_*_mblen(), a global function, inside pg_*_verifychar(), it
seems it won't benefit as much from an ascii fast path as one whose
pg_*_verifychar() has no function calls. I'm not sure whether a compiler
can inline a global function's body into call sites in the unit where it's
defined. (I haven't looked at the assembly.) But recall that you didn't
commit 0002 from the earlier encoding change, because it wasn't performing.
I looked at that patch again, and while it inlined the pg_utf8_verifychar()
call, it still called the global function pg_utf8_islegal().

If the above is anything to go by, on gcc at least, I don't think we need
to worry about a regression when adding an ascii fast path to non-utf-8
multibyte encodings.

Regarding SSE, I've added an ascii fast path in my local branch, but it's
not going to be as big a difference because 1) the check is more expensive
in terms of branches than the C case, and 2) because the general case is so
fast already, it's hard to improve upon. I just need to do some testing and
cleanup on the whole thing, and that'll be ready to share.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/src/include/port/pg_utf8.h b/src/include/port/pg_utf8.h
index dac2afc130..a0c94dd4f3 100644
--- a/src/include/port/pg_utf8.h
+++ b/src/include/port/pg_utf8.h
@@ -53,26 +53,25 @@ extern int pg_validate_utf8_fallback(const unsigned char *s, int len);
 static inline int
 check_ascii(const unsigned char *s, int len)
 {
-	uint64		half1, half2,
+	uint64		chunk,
 highbit_mask;
 
-	if  (len >= 2 * sizeof(uint64))
+	if  (len >= sizeof(uint64))
 	{
-		memcpy(, s, sizeof(uint64));
-		memcpy(, s + sizeof(uint64), sizeof(uint64));
+		memcpy(, s, sizeof(uint64));
 
 		/*
 		 * If there are any zero bytes, bail and let the slow
 		 * path handle it.
 		 */
-		if (HAS_ZERO(half1) || HAS_ZERO(half2))
+		if (HAS_ZERO(chunk))
 			return 0;
 
 		/* Check if any bytes in this chunk have the high bit set. */
-		highbit_mask = ((half1 | half2) & UINT64CONST(0x8080808080808080));
+		highbit_mask = (chunk & UINT64CONST(0x8080808080808080));
 
 		if (!highbit_mask)
-			return 2 * sizeof(uint64);
+			return sizeof(uint64);
 		else
 			return 0;
 	}
diff --git a/src/include/port/pg_utf8.h b/src/include/port/pg_utf8.h
index a0c94dd4f3..62272a7c93 100644
--- a/src/include/port/pg_utf8.h
+++ b/src/include/port/pg_utf8.h
@@ -53,6 +53,7 @@ extern int 

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-16 Thread John Naylor
I wrote:

> [v3]
> - It's not smart enough to stop at the last valid character boundary --
it's either all-valid or it must start over with the fallback. That will
have to change in order to work with the proposed noError conversions. It
shouldn't be very hard, but needs thought as to the clearest and safest way
to code it.

In v4, it should be able to return an accurate count of valid bytes even
when the end crosses a character boundary.

> - This is my first time hacking autoconf, and it still seems slightly
broken, yet functional on my machine at least.

It was actually completely broken if you tried to pass the special flags to
configure. I redesigned this part and it seems to work now.

--
John Naylor
EDB: http://www.enterprisedb.com


v4-SSE4-with-autoconf-support.patch
Description: Binary data


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread John Naylor
On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas  wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first
I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I
found when I was writing regression tests. It shouldn't materially affect
performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte
encodings other than UTF-8. But instead of using one of those encodings, I
simulated non-UTF-8 by copying the pattern used for those: in the loop,
check for ascii then either advance or verify one character. It was a quick
way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to
compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

 chinese | mixed | ascii
-+---+---
1565 |   848 |   365

master:

 chinese | mixed | ascii
-+---+---
1081 |   761 |   366

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
-+---+---
1279 |   656 |94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing
the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 1) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
>   mixed | ascii
> ---+---
>1310 |  1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
>   mixed | ascii
> ---+---
>2979 |   910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform?
That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline --
we already have examples of that idiom elsewhere. I was actually going to
suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies
it was tested carefully on Arm and I have no reason to doubt that), I could
try to follow that example more faithfully by computing the actual
codepoints. It's more computation and just as many branches as far as I can
tell, but it's not a lot of work. I can add that alternative fallback to
the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time.
I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work.
It still needs some stress testing to find any corner case bugs, but it
shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

 chinese | mixed | ascii
-+---+---
1082 |   751 |   364

v3 with SSE4.1:

 chinese | mixed | ascii
-+---+---
 127 |   128 |   126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the
magic constants lookup tables almost verbatim. The main body of the code
has no intrinsics at all (I think). They're all hidden inside static inline
helper functions. I reused some cryptic variable names from simdjson. It's
a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary --
it's either all-valid or it must start over with the fallback. That will
have to change in order to work with the proposed noError conversions. It
shouldn't be very hard, but needs thought as to the clearest and safest way
to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit
more careful since a valid ascii chunk could be preceded by an incomplete
sequence at the end of the previous chunk. Not too hard, just a bit more
work.

- This is my first time hacking autoconf, and it still seems slightly
broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so
it just reuses SSE4.2 for the runtime check and the macro names. Also, it
doesn't test for SSE2, it just insists on 64-bit for the runtime check. I
imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic
constants macros. That needs some polish.

I also attached a C file that 

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread Heikki Linnakangas

On 13/02/2021 03:31, John Naylor wrote:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas > wrote:

 >
 > I also tested the fallback implementation from the simdjson library
 > (included in the patch, if you uncomment it in simdjson-glue.c):
 >
 >   mixed | ascii
 > ---+---
 >     447 |    46
 > (1 row)
 >
 > I think we should at least try to adopt that. At a high level, it looks
 > pretty similar your patch: you load the data 8 bytes at a time, check if
 > there are all ASCII. If there are any non-ASCII chars, you check the
 > bytes one by one, otherwise you load the next 8 bytes. Your patch should
 > be able to achieve the same performance, if done right. I don't think
 > the simdjson code forbids \0 bytes, so that will add a few cycles, but
 > still.

Attached is a patch that does roughly what simdjson fallback did, except 
I use straight tests on the bytes and only calculate code points in 
assertion builds. In the course of doing this, I found that my earlier 
concerns about putting the ascii check in a static inline function were 
due to my suboptimal loop implementation. I had assumed that if the 
chunked ascii check failed, it had to check all those bytes one at a 
time. As it turns out, that's a waste of the branch predictor. In the v2 
patch, we do the chunked ascii check every time we loop. With that, I 
can also confirm the claim in the Lemire paper that it's better to do 
the check on 16-byte chunks:


(MacOS, Clang 10)

master:

  chinese | mixed | ascii
-+---+---
     1081 |   761 |   366

v2 patch, with 16-byte stride:

  chinese | mixed | ascii
-+---+---
      806 |   474 |    83

patch but with 8-byte stride:

  chinese | mixed | ascii
-+---+---
      792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that 
is also pretty good performance-wise.


Cool.

It regresses from master on pure 
multibyte input, but that case is still faster than PG13, which I 
simulated by reverting 6c5576075b0f9 and b80e10638e3:


I thought the "chinese" numbers above are pure multibyte input, and it 
seems to do well on that. Where does it regress? In multibyte encodings 
other than UTF-8? How bad is the regression?


I tested this on my first generation Raspberry Pi (chipmunk). I had to 
tweak it a bit to make it compile, since the SSE autodetection code was 
not finished yet. And I used generate_series(1, 1000) instead of 
generate_series(1, 1) in the test script (mbverifystr-speed.sql) 
because this system is so slow.


master:

 mixed | ascii
---+---
  1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

 mixed | ascii
---+---
  2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is 
expensive on this platform.


- Heikki




Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-12 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> ---+---
> 447 |46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Attached is a patch that does roughly what simdjson fallback did, except I
use straight tests on the bytes and only calculate code points in assertion
builds. In the course of doing this, I found that my earlier concerns about
putting the ascii check in a static inline function were due to my
suboptimal loop implementation. I had assumed that if the chunked ascii
check failed, it had to check all those bytes one at a time. As it turns
out, that's a waste of the branch predictor. In the v2 patch, we do the
chunked ascii check every time we loop. With that, I can also confirm the
claim in the Lemire paper that it's better to do the check on 16-byte
chunks:

(MacOS, Clang 10)

master:

 chinese | mixed | ascii
-+---+---
1081 |   761 |   366

v2 patch, with 16-byte stride:

 chinese | mixed | ascii
-+---+---
 806 |   474 |83

patch but with 8-byte stride:

 chinese | mixed | ascii
-+---+---
 792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that is
also pretty good performance-wise. It regresses from master on pure
multibyte input, but that case is still faster than PG13, which I simulated
by reverting 6c5576075b0f9 and b80e10638e3:

~PG13:

 chinese | mixed | ascii
-+---+---
1565 |   848 |   365

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
-+---+---
1279 |   656 |94


v2 has a rough start to having multiple implementations in
src/backend/port. Next steps are:

1. Add more tests for utf-8 coverage (in addition to the ones to be added
by the noError argument patch)
2. Add SSE4 validator -- it turns out the demo I referred to earlier
doesn't match the algorithm in the paper. I plan to only copy the lookup
tables from simdjson verbatim, but the code will basically be written from
scratch, using  simdjson as a hint.
3. Adjust configure.ac

--
John Naylor
EDB: http://www.enterprisedb.com


v2-add-portability-stub-and-new-fallback.patch
Description: Binary data


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
On Tue, Feb 9, 2021 at 4:22 PM Heikki Linnakangas  wrote:
>
> On 09/02/2021 22:08, John Naylor wrote:
> > Maybe there's a smarter way to check for zeros in C. Or maybe be more
> > careful about cache -- running memchr() on the whole input first might
> > not be the best thing to do.
>
> The usual trick is the haszero() macro here:
> https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's
> how memchr() is typically implemented, too.

Thanks for that. Checking with that macro each loop iteration gives a small
boost:

v1, but using memcpy()

 mixed | ascii
---+---
   601 |   129

with haszero()

 mixed | ascii
---+---
   583 |   105

remove zero-byte check:

 mixed | ascii
---+---
   588 |93

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used
mostly SSE2 with just three intrinsics from SSSE3. That's widely available
by now. He measured that at 0.7 cycles per byte, which is still good
compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would
assume that if that check (and the corresponding runtime check) passes, we
can assume SSE2. That code has three licenses to choose from -- Apache 2,
Boost, and MIT. Something like that might be straightforward to start from.
I think the only obstacles to worry about are license and getting it to fit
into our codebase. Adding more than zero high-level comments with a good
description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1,
but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could
probably fold all three into the SSE4.2 CRC check and have a single symbol
to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the
following:

master

 mixed | ascii
---+---
   757 |   366

Lemire demo:

 mixed | ascii
---+---
   172 |   168

This one lacks an ascii fast path, but the AVX2 version in the same file
has one that could probably be easily adapted. With that, I think this
would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern
C++.

Simdjson can use AVX -- do you happen to know which target it got compiled
to? AVX vectors are 256-bits wide and that requires OS support. The OS's we
care most about were updated 8-12 years ago, but that would still be
something to check, in addition to more configure checks.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include

--
John Naylor
EDB: http://www.enterprisedb.com


utf-sse42-demo.patch
Description: Binary data


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread Heikki Linnakangas

On 09/02/2021 22:08, John Naylor wrote:
Maybe there's a smarter way to check for zeros in C. Or maybe be more 
careful about cache -- running memchr() on the whole input first might 
not be the best thing to do.


The usual trick is the haszero() macro here: 
https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's 
how memchr() is typically implemented, too.


- Heikki




Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> ---+---
> 447 |46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

That fallback is very similar to my "inline C" case upthread, and they both
actually check 16 bytes at a time (the comment is wrong in the patch you
shared). I can work back and show how the performance changes with each
difference (just MacOS, clang 10 here):

master

 mixed | ascii
---+---
   757 |   366

v1, but using memcpy()

 mixed | ascii
---+---
   601 |   129

remove zero-byte check:

 mixed | ascii
---+---
   588 |93

inline ascii fastpath into pg_utf8_verifystr()

 mixed | ascii
---+---
   595 |71

use 16-byte stride

 mixed | ascii
---+---
   652 |49

With this cpu/compiler, v1 is fastest on the mixed input all else being
equal.

Maybe there's a smarter way to check for zeros in C. Or maybe be more
careful about cache -- running memchr() on the whole input first might not
be the best thing to do.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-08 Thread John Naylor
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
> As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's
> algorithm from the simdjson library [1], see attached patch. I
> microbenchmarked it using the the same test I used before [2].

I've been looking at various iterations of Lemire's utf8 code, and trying
it out was next on my list, so thanks for doing that!

> These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1
> 20210110"
>
> unpatched master:
>
> postgres=# \i mbverifystr-speed.sql
> CREATE FUNCTION
>   mixed | ascii
> ---+---
> 728 |   393
> (1 row)
>
> v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:
>
>   mixed | ascii
> ---+---
> 759 |98
> (1 row)

Hmm, the mixed case got worse -- I haven't seen that in any of my tests.

> simdjson-utf8-hack.patch:
>
>   mixed | ascii
> ---+---
>  53 |31
> (1 row)
>
> So clearly that algorithm is fast. Not sure if it has a high startup
> cost, or large code size, or other tradeoffs that we don't want.

The simdjson lib uses everything up through AVX512 depending on what
hardware is available. I seem to remember reading that high start-up cost
is more relevant to floating point than to integer ops, but I could be
wrong. Just the utf8 portion is surely tiny also.

> At
> least it depends on SIMD instructions, so it requires more code for the
> architecture-specific implementations and autoconf logic and all that.

One of his earlier demos [1] (in simdutf8check.h) had a version that used
mostly SSE2 with just three intrinsics from SSSE3. That's widely available
by now. He measured that at 0.7 cycles per byte, which is still good
compared to AVX2 0.45 cycles per byte [2].

Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would
assume that if that check (and the corresponding runtime check) passes, we
can assume SSE2. That code has three licenses to choose from -- Apache 2,
Boost, and MIT. Something like that might be straightforward to start
from. I think the only obstacles to worry about are license and getting it
to fit into our codebase. Adding more than zero high-level comments with a
good description of how it works in detail is also a bit of a challenge.

> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> ---+---
> 447 |46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

Okay, I'll look into that.

> PS. Your patch as it stands isn't safe on systems with strict alignment,
> the string passed to the verify function isn't guaranteed to be 8 bytes
> aligned. Use memcpy to fetch the next 8-byte chunk to fix.

Will do.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include
[2]
https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-08 Thread Heikki Linnakangas

On 07/02/2021 22:24, John Naylor wrote:
Here is a more polished version of the function pointer approach, now 
adapted to all multibyte encodings. Using the not-yet-committed tests 
from [1], I found a thinko bug that resulted in the test for nul bytes 
to not only be wrong, but probably also elided by the compiler. Doing it 
correctly is noticeably slower on pure ascii, but still several times 
faster than before, so the conclusions haven't changed any. I'll run 
full measurements later this week, but I'll share the patch now for review.


As a quick test, I hacked up pg_utf8_verifystr() to use Lemire's 
algorithm from the simdjson library [1], see attached patch. I 
microbenchmarked it using the the same test I used before [2].


These results are with "gcc -O2" using "gcc (Debian 10.2.1-6) 10.2.1 
20210110"


unpatched master:

postgres=# \i mbverifystr-speed.sql
CREATE FUNCTION
 mixed | ascii
---+---
   728 |   393
(1 row)

v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch:

 mixed | ascii
---+---
   759 |98
(1 row)

simdjson-utf8-hack.patch:

 mixed | ascii
---+---
53 |31
(1 row)

So clearly that algorithm is fast. Not sure if it has a high startup 
cost, or large code size, or other tradeoffs that we don't want. At 
least it depends on SIMD instructions, so it requires more code for the 
architecture-specific implementations and autoconf logic and all that. 
Nevertheless I think it deserves a closer look, I'm a bit reluctant to 
put in half-way measures, when there's a clearly superior algorithm out 
there.


I also tested the fallback implementation from the simdjson library 
(included in the patch, if you uncomment it in simdjson-glue.c):


 mixed | ascii
---+---
   447 |46
(1 row)

I think we should at least try to adopt that. At a high level, it looks 
pretty similar your patch: you load the data 8 bytes at a time, check if 
there are all ASCII. If there are any non-ASCII chars, you check the 
bytes one by one, otherwise you load the next 8 bytes. Your patch should 
be able to achieve the same performance, if done right. I don't think 
the simdjson code forbids \0 bytes, so that will add a few cycles, but 
still.


[1] https://github.com/simdjson/simdjson
[2] 
https://www.postgresql.org/message-id/06d45421-61b8-86dd-e765-f1ce527a5...@iki.fi


- Heikki

PS. Your patch as it stands isn't safe on systems with strict alignment, 
the string passed to the verify function isn't guaranteed to be 8 bytes 
aligned. Use memcpy to fetch the next 8-byte chunk to fix.


diff --git a/src/backend/Makefile b/src/backend/Makefile
index 9672e2cb43a..aa79526b0cf 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -54,6 +54,9 @@ ifeq ($(with_systemd),yes)
 LIBS += -lsystemd
 endif
 
+LIBS += -lsimdjson
+
+
 ##
 
 all: submake-libpgport submake-catalog-headers submake-utils-headers postgres $(POSTGRES_IMP)
@@ -63,7 +66,7 @@ ifneq ($(PORTNAME), win32)
 ifneq ($(PORTNAME), aix)
 
 postgres: $(OBJS)
-	$(CC) $(CFLAGS) $(call expand_subsys,$^) $(LDFLAGS) $(LDFLAGS_EX) $(export_dynamic) $(LIBS) -o $@
+	$(CXX) $(CFLAGS) $(call expand_subsys,$^) $(LDFLAGS) $(LDFLAGS_EX) $(export_dynamic) $(LIBS) -o $@
 
 endif
 endif
diff --git a/src/common/Makefile b/src/common/Makefile
index 5422579a6a2..f3e4d985062 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -92,6 +92,9 @@ OBJS_COMMON += \
 	sha2.o
 endif
 
+simdjson-glue_srv.o:
+	$(CXX) $(CXXFLAGS) -c simdjson-glue.c -o simdjson-glue_srv.o
+
 # A few files are currently only built for frontend, not server
 # (Mkvcbuild.pm has a copy of this list, too).  logging.c is excluded
 # from OBJS_FRONTEND_SHLIB (shared library) as a matter of policy,
@@ -110,6 +113,9 @@ OBJS_FRONTEND = \
 OBJS_SHLIB = $(OBJS_FRONTEND_SHLIB:%.o=%_shlib.o)
 OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
 
+# Only use the simdjson version in the server
+OBJS_SRV += simdjson-glue_srv.o
+
 # where to find gen_keywordlist.pl and subsidiary files
 TOOLSDIR = $(top_srcdir)/src/tools
 GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
diff --git a/src/common/simdjson-glue.c b/src/common/simdjson-glue.c
new file mode 100644
index 000..9c300bde022
--- /dev/null
+++ b/src/common/simdjson-glue.c
@@ -0,0 +1,90 @@
+#include 
+
+#include 
+#include 
+#include 
+
+// Heikki: This is copy-pasted from simdjson library
+// src/fallback/dom_parser_implementation.cpp
+//
+// Note: Apache licensed!
+
+// credit: based on code from Google Fuchsia (Apache Licensed)
+static bool fallback_validate_utf8(const char *buf, size_t len)
+{
+  const uint8_t *data = reinterpret_cast(buf);
+  uint64_t pos = 0;
+  uint32_t code_point = 0;
+  while (pos < len) {
+// check of the next 8 bytes are ascii.
+uint64_t next_pos = pos + 16;
+if (next_pos <= len) { // if it is safe to read 8 more bytes, check that they are ascii
+  uint64_t v1;
+  memcpy(, 

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-07 Thread John Naylor
Here is a more polished version of the function pointer approach, now
adapted to all multibyte encodings. Using the not-yet-committed tests from
[1], I found a thinko bug that resulted in the test for nul bytes to not
only be wrong, but probably also elided by the compiler. Doing it correctly
is noticeably slower on pure ascii, but still several times faster than
before, so the conclusions haven't changed any. I'll run full measurements
later this week, but I'll share the patch now for review.

[1]
https://www.postgresql.org/message-id/11d39e63-b80a-5f8d-8043-fff04201f...@iki.fi

-- 
John Naylor
EDB: http://www.enterprisedb.com


v1-0001-Add-an-ASCII-fast-path-to-multibyte-encoding-veri.patch
Description: Binary data


Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-04 Thread John Naylor
On Mon, Feb 1, 2021 at 2:01 PM Heikki Linnakangas  wrote:
>
> On 01/02/2021 19:32, John Naylor wrote:
> > It makes sense to start with the ascii subset of UTF-8 for a couple
> > reasons. First, ascii is very widespread in database content,
> > particularly in bulk loads. Second, ascii can be validated using the
> > simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and
> > I'm guessing we can detect that at compile time and not mess with
> > runtime checks. The examples above using SSE for the general case are
> > much more complicated and involve SSE 4.2 or AVX.
>
> I wonder how using SSE compares with dealing with 64 or 32-bit words at
> a time, using regular instructions? That would be more portable.

I gave that a shot, and it's actually pretty good. According to this paper,
[1], 16 bytes was best and gives a good apples-to-apples comparison to SSE
registers, so I tried both 16 and 8 bytes.

> All supported encodings are ASCII subsets. Might be best to putt the
> ASCII-check into a static inline function and use it in all the verify
> functions. I presume it's only a few instructions, and these functions
> can be pretty performance sensitive.

I tried both the static inline function and also putting the whole
optimized utf-8 loop in a separate function to which the caller passes a
pointer to the appropriate pg_*_verifychar().

In the table below, "inline" refers to coding directly inside
pg_utf8_verifystr(). Both C and SSE are in the same patch, with an #ifdef.
I didn't bother splitting them out because for other encodings, we want one
of the other approaches above. For those, "C retail" refers to a static
inline function to code the contents of the inner loop, if I understood
your suggestion correctly. This needs more boilerplate in each function, so
I don't prefer this. "C func pointer" refers to the pointer approach I just
mentioned. That is the cleanest looking way to generalize it, so I only
tested that version with different strides -- 8- and 16-bytes

This is the same test I used earlier, which is the test in [2] but adding
an almost-pure multibyte Chinese text of about the same size.

x64-64 Linux gcc 8.4.0:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |1480 |   848 |   428
 inline SSE   |1617 |   634 |63
 inline C |1481 |   843 |50
 C retail |1493 |   838 |49
 C func pointer   |1467 |   851 |49
 C func pointer 8 |1518 |   757 |56

x64-64 MacOS clang 10.0.0:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |1086 |   760 |   374
 inline SSE   |1081 |   529 |70
 inline C |1093 |   649 |49
 C retail |1132 |   695 |   152
 C func pointer   |1085 |   609 |59
 C func pointer 8 |1099 |   571 |71

PowerPC-LE Linux gcc 4.8.5:

  build   | chinese | mixed | ascii
--+-+---+---
 master   |2961 |  1525 |   871
 inline SSE   |   (n/a) | (n/a) | (n/a)
 inline C |2911 |  1329 |80
 C retail |2838 |  1311 |   102
 C func pointer   |2828 |  1314 |80
 C func pointer 8 |3143 |  1249 |   133

Looking at the results, the main advantage of SSE here is it's more robust
for mixed inputs. If a 16-byte chunk is not ascii-only but contains a block
of ascii at the front, we can skip those with a single CPU instruction, but
in C, we have to verify the whole chunk using the slow path.

The "C func pointer approach" seems to win out over the "C retail" approach
(static inline function).

Using an 8-byte stride is slightly better for mixed inputs on all platforms
tested, but regresses on pure ascii and also seems to regress on pure
multibyte. The difference in the multibyte caes is small enough that it
could be random, but it happens on two platforms, so I'd say it's real. On
the other hand, pure multibyte is not as common as mixed text.

Overall, I think the function pointer approach with an 8-byte stride is the
best balance. If that's agreeable, next I plan to test with short inputs,
because I think we'll want a guard if-statement to only loop through the
fast path if the string is long enough to justify that.

> > I also gave a shot at doing full UTF-8 recognition using a DFA, but so
> > far that has made performance worse. If I ever have more success with
> > that, I'll add that in the mix.
>
> That's disappointing. Perhaps the SIMD algorithms have higher startup
> costs, so that you need longer inputs to benefit? In that case, it might
> make sense to check the length of the input and only use the SIMD
> algorithm if the input is long enough.

I changed topics a bit quickly, but here I'm talking about using a
table-driven state machine to verify the multibyte case. It's possible I
did something wrong, since my model implementation decodes, and having to
keep track of how 

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-01 Thread Heikki Linnakangas

On 01/02/2021 19:32, John Naylor wrote:
It makes sense to start with the ascii subset of UTF-8 for a couple 
reasons. First, ascii is very widespread in database content, 
particularly in bulk loads. Second, ascii can be validated using the 
simple SSE2 intrinsics that come with (I believe) any x64-64 chip, and 
I'm guessing we can detect that at compile time and not mess with 
runtime checks. The examples above using SSE for the general case are 
much more complicated and involve SSE 4.2 or AVX.


I wonder how using SSE compares with dealing with 64 or 32-bit words at 
a time, using regular instructions? That would be more portable.


Here are some numbers on my laptop (MacOS/clang 10 -- if the concept is 
okay, I'll do Linux/gcc and add more inputs). The test is the same as 
Heikki shared in [3], but I added a case with >95% Chinese characters 
just to show how that compares to the mixed ascii/multibyte case.


master:

  chinese | mixed | ascii
-+---+---
     1081 |   761 |   366

patch:

  chinese | mixed | ascii
-+---+---
     1103 |   498 |    51

The speedup in the pure ascii case is nice.


Yep.

In the attached POC, I just have a pro forma portability stub, and left 
full portability detection for later. The fast path is inlined inside 
pg_utf8_verifystr(). I imagine the ascii fast path could be abstracted 
into a separate function to which is passed a function pointer for full 
encoding validation. That would allow other encodings with strict ascii 
subsets to use this as well, but coding that abstraction might be a 
little messy, and b80e10638e3 already gives a performance boost over PG13.


All supported encodings are ASCII subsets. Might be best to putt the 
ASCII-check into a static inline function and use it in all the verify 
functions. I presume it's only a few instructions, and these functions 
can be pretty performance sensitive.


I also gave a shot at doing full UTF-8 recognition using a DFA, but so 
far that has made performance worse. If I ever have more success with 
that, I'll add that in the mix.


That's disappointing. Perhaps the SIMD algorithms have higher startup 
costs, so that you need longer inputs to benefit? In that case, it might 
make sense to check the length of the input and only use the SIMD 
algorithm if the input is long enough.


- Heikki