Re: speed up verifying UTF-8

2021-12-20 Thread John Naylor
On Fri, Dec 17, 2021 at 9:29 AM John Naylor wrote: > > I plan to push v25 early next week, unless there are further comments. Pushed, thanks everyone! -- John Naylor EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

2021-12-17 Thread John Naylor
I plan to push v25 early next week, unless there are further comments. -- John Naylor EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

2021-12-13 Thread John Naylor
On Fri, Dec 10, 2021 at 2:33 PM Heikki Linnakangas wrote: > I had another look at this now. Looks good, just a few minor comments below: Thanks for reviewing! I've attached v25 to address your points. > This function assumes that the input len is a multiple of 8. There's an > assertion for

RE: [EXTERNAL] Re: speed up verifying UTF-8

2021-12-10 Thread Godfrin, Philippe E
>-Original Message- >From: Heikki Linnakangas >Sent: Friday, December 10, 2021 12:34 PM >To: John Naylor ; Vladimir Sitnikov > >Cc: pgsql-hackers ; Amit Khandekar >; Thomas Munro ; Greg Stark > >Subject: [EXTERNAL] Re: speed up verifying UTF-8 > >On 20

Re: speed up verifying UTF-8

2021-12-10 Thread Heikki Linnakangas
On 20/10/2021 00:42, John Naylor wrote: I've decided I'm not quite comfortable with the additional complexity in the build system introduced by the SIMD portion of the previous patches. It would make more sense if the pure C portion were unchanged, but with the shift-based DFA plus the bitwise

Re: speed up verifying UTF-8

2021-12-08 Thread John Naylor
It occurred to me that the DFA + ascii quick check approach could also be adapted to speed up some cases where we currently walk a string counting characters, like this snippet in text_position_get_match_pos(): /* Convert the byte position to char position. */ while (state->refpoint <

Re: speed up verifying UTF-8

2021-10-19 Thread John Naylor
I've decided I'm not quite comfortable with the additional complexity in the build system introduced by the SIMD portion of the previous patches. It would make more sense if the pure C portion were unchanged, but with the shift-based DFA plus the bitwise ASCII check, we have a portable

Re: speed up verifying UTF-8

2021-08-26 Thread Vladimir Sitnikov
>Attached is v23 incorporating the 32-bit transition table, with the necessary comment adjustments 32bit table is nice. Would you please replace https://github.com/BobSteagall/utf_utils/blob/master/src/utf_utils.cpp URL with

Re: speed up verifying UTF-8

2021-08-26 Thread John Naylor
I wrote: > Naively, the shift-based DFA requires 64-bit integers to encode the transitions, but I recently came across an idea from Dougall Johnson of using the Z3 SMT solver to pack the transitions into 32-bit integers [1]. That halves the size of the transition table for free. I adapted that

Re: speed up verifying UTF-8

2021-08-24 Thread John Naylor
Naively, the shift-based DFA requires 64-bit integers to encode the transitions, but I recently came across an idea from Dougall Johnson of using the Z3 SMT solver to pack the transitions into 32-bit integers [1]. That halves the size of the transition table for free. I adapted that effort to the

Re: speed up verifying UTF-8

2021-08-04 Thread John Naylor
I wrote: > If we have only 16 bytes in the input, it still seems to be faster to use SSE, even though it's called through a function pointer on x86. I didn't test the DFA path, but I don't think the conclusion would be different. I'll include the 16 threshold next time I need to update the patch.

Re: speed up verifying UTF-8

2021-07-29 Thread John Naylor
On Mon, Jul 26, 2021 at 8:56 AM John Naylor wrote: > > > > > Does that (and "len >= 32" condition) mean the patch does not improve validation of the shorter strings (the ones less than 32 bytes)? > > Right. Also, the 32 byte threshold was just a temporary need for testing 32-byte stride --

Re: speed up verifying UTF-8

2021-07-28 Thread John Naylor
I wrote: > On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > > > >+ utf8_advance(s, state, len); > > >+ > > >+ /* > > >+ * If we saw an error during the loop, let the caller handle it. We treat > > >+ * all other states as success. > > >+ */ > > >+ if

Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Just wondering, do you have the code in a GitHub/Gitlab branch? Sorry, I didn't see this earlier. No, I don't. -- John Naylor EDB: http://www.enterprisedb.com

Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Just wondering, do you have the code in a GitHub/Gitlab branch? > > >+ utf8_advance(s, state, len); > >+ > >+ /* > >+ * If we saw an error during the loop, let the caller handle it. We treat > >+ * all

Re: speed up verifying UTF-8

2021-07-26 Thread Vladimir Sitnikov
Just wondering, do you have the code in a GitHub/Gitlab branch? >+ utf8_advance(s, state, len); >+ >+ /* >+ * If we saw an error during the loop, let the caller handle it. We treat >+ * all other states as success. >+ */ >+ if (state == ERR) >+ return 0; Did you mean state = utf8_advance(s,

Re: speed up verifying UTF-8

2021-07-26 Thread John Naylor
Attached is v20, which has a number of improvements: 1. Cleaned up and explained DFA coding. 2. Adjusted check_ascii to return bool (now called is_valid_ascii) and to produce an optimized loop, using branch-free accumulators. That way, it doesn't need to be rewritten for different input lengths.

Re: speed up verifying UTF-8

2021-07-21 Thread John Naylor
On Wed, Jul 21, 2021 at 12:13 PM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > It looks like the same utf8_advance function is good for both fast-path and for the slow path. > Then pg_utf8_verifychar could be removed altogether along with the corresponding IS_*_BYTE_LEAD macros.

Re: speed up verifying UTF-8

2021-07-21 Thread Vladimir Sitnikov
>I'm pretty confident this improvement is architecture-independent. Thanks for testing it with different architectures. It looks like the same utf8_advance function is good for both fast-path and for the slow path. Then pg_utf8_verifychar could be removed altogether along with the corresponding

Re: speed up verifying UTF-8

2021-07-20 Thread John Naylor
> On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > An alternative idea: should we optimize for validation of **valid** inputs rather than optimizing the worst case? > > In other words, what if the implementation processes all characters always and uses a

Re: speed up verifying UTF-8

2021-07-19 Thread John Naylor
On Mon, Jul 19, 2021 at 9:43 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ). > Just in case: I know x86 wound

Re: speed up verifying UTF-8

2021-07-19 Thread Vladimir Sitnikov
Thank you, It looks like it is important to have shrx for x86 which appears only when -march=x86-64-v3 is used (see https://github.com/golang/go/issues/47120#issuecomment-877629712 ). Just in case: I know x86 wound not use fallback implementation, however, the sole purpose of shift-based DFA is

Re: speed up verifying UTF-8

2021-07-18 Thread Amit Khandekar
On Sat, 17 Jul 2021 at 04:48, John Naylor wrote: > v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's > chunk idea for multibyte, and it's pretty good on x86. On Power8, not so > much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8, > with some

Re: speed up verifying UTF-8

2021-07-18 Thread John Naylor
I wrote: > On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > > > Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ? > > I did consider some kind of DFA a while back and

Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
Forgot the attachments... -- John Naylor EDB: http://www.enterprisedb.com v17-0001-Rewrite-pg_utf8_verifystr-for-speed.patch Description: Binary data v17-0004-Second-attempt-at-addressing-performance-regress.patch Description: Binary data

Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
My v16 experimental patches were a bit messy, so I've organized an experimental series that applies cumulatively, to try to trace the effects of various things. v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's chunk idea for multibyte, and it's pretty good on x86. On

Re: speed up verifying UTF-8

2021-07-16 Thread John Naylor
On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov < sitnikov.vladi...@gmail.com> wrote: > > Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ? I did consider some kind of DFA a while back and it was too slow. --

Re: speed up verifying UTF-8

2021-07-15 Thread Vladimir Sitnikov
Have you considered shift-based DFA for a portable implementation https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ? Vladimir >

Re: speed up verifying UTF-8

2021-07-15 Thread John Naylor
I wrote: > To simplify the constants, I do shift down to uint32, and I didn't bother working around that. v16alpha regressed on worst-case input, so for v16beta I went back to earlier coding for the one-byte ascii check. That helped, but it's still slower than v14. It occurred to me that I could

Re: speed up verifying UTF-8

2021-07-15 Thread John Naylor
On Thu, Jul 15, 2021 at 1:10 AM Amit Khandekar wrote: > - check_ascii() seems to be used only for 64-bit chunks. So why not > remove the len argument and the len <= sizeof(int64) checks inside the > function. We can rename it to check_ascii64() for clarity. Thanks for taking a look! Well yes,

Re: speed up verifying UTF-8

2021-07-14 Thread Amit Khandekar
On Tue, 13 Jul 2021 at 01:15, John Naylor wrote: > > It seems like it would be easy to have pg_utf8_verify_one in my proposed > > pg_utf8.h header and replace the body of pg_utf8_verifychar with it. > > 0001: I went ahead and tried this for v15, and also attempted some clean-up: > > - Rename

Re: speed up verifying UTF-8

2021-07-12 Thread John Naylor
I wrote: > I don't think the new structuring will pose any challenges for rebasing 0002, either. This might need some experimentation, though: > > + * Subroutine of pg_utf8_verifystr() to check on char. Returns the length of the > + * character at *s in bytes, or 0 on invalid input or premature

Re: speed up verifying UTF-8

2021-06-30 Thread John Naylor
On Wed, Jun 30, 2021 at 7:18 AM Heikki Linnakangas wrote: > Hmm, there's one more simple trick we can do: We can have a separate > fast-path version of the loop when there are at least 8 bytes of input > left, skipping all the length checks. With that: Good idea, and the numbers look good on

Re: speed up verifying UTF-8

2021-06-30 Thread Heikki Linnakangas
On 29/06/2021 14:20, John Naylor wrote: I still wasn't quite happy with the churn in the regression tests, so for v13 I gave up on using both the existing utf8 table and my new one for the "padded input" tests, and instead just copied the NUL byte test into the new table. Also added a primary

Re: speed up verifying UTF-8

2021-06-29 Thread John Naylor
I still wasn't quite happy with the churn in the regression tests, so for v13 I gave up on using both the existing utf8 table and my new one for the "padded input" tests, and instead just copied the NUL byte test into the new table. Also added a primary key to make sure the padded test won't give

Re: speed up verifying UTF-8

2021-06-10 Thread John Naylor
I wrote: > Also, if SSE is accepted into the tree, then the C fallback is only important on platforms like PowerPC64 and Arm64, so we can make the tradeoff by testing those more carefully. I'll test on PowerPC soon. I got around to testing on POWER8 / Linux / gcc 4.8.5 and found a regression in

Re: speed up verifying UTF-8

2021-06-10 Thread John Naylor
On Wed, Jun 9, 2021 at 7:02 AM Heikki Linnakangas wrote: > What is the worst case scenario for this algorithm? Something where the > new fast ASCII check never helps, but is as fast as possible with the > old code. For that, I added a repeating pattern of '123456789012345ä' to > the test set

Re: speed up verifying UTF-8

2021-06-09 Thread Heikki Linnakangas
On 07/06/2021 15:39, John Naylor wrote: On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas > wrote: > > On 03/06/2021 21:58, John Naylor wrote: > > The microbenchmark is the same one you attached to [1], which I extended > > with a 95% multibyte case. > > Could you

Re: speed up verifying UTF-8

2021-06-07 Thread John Naylor
On Mon, Jun 7, 2021 at 8:24 AM Heikki Linnakangas wrote: > > On 03/06/2021 21:58, John Naylor wrote: > > The microbenchmark is the same one you attached to [1], which I extended > > with a 95% multibyte case. > > Could you share the exact test you're using? I'd like to test this on my > old

Re: speed up verifying UTF-8

2021-06-07 Thread Heikki Linnakangas
On 03/06/2021 21:58, John Naylor wrote: > What test set have you been using for performance testing this? I'd like The microbenchmark is the same one you attached to [1], which I extended with a 95% multibyte case. Could you share the exact test you're using? I'd like to test this on my

Re: speed up verifying UTF-8

2021-06-06 Thread John Naylor
On Thu, Jun 3, 2021 at 3:22 PM Heikki Linnakangas wrote: > > On 03/06/2021 22:16, Heikki Linnakangas wrote: > > On 03/06/2021 22:10, John Naylor wrote: > >> On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas >> > wrote: > >> > x1 = half1 +

Re: speed up verifying UTF-8

2021-06-03 Thread Heikki Linnakangas
On 03/06/2021 22:16, Heikki Linnakangas wrote: On 03/06/2021 22:10, John Naylor wrote: On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas mailto:hlinn...@iki.fi>> wrote: >                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f); >                 x2 = half2 +

Re: speed up verifying UTF-8

2021-06-03 Thread Heikki Linnakangas
On 03/06/2021 22:10, John Naylor wrote: On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas > wrote: >                 x1 = half1 + UINT64CONST(0x7f7f7f7f7f7f7f7f); >                 x2 = half2 + UINT64CONST(0x7f7f7f7f7f7f7f7f); > >                 /* then check that

Re: speed up verifying UTF-8

2021-06-03 Thread John Naylor
On Thu, Jun 3, 2021 at 3:08 PM Heikki Linnakangas wrote: > > On 03/06/2021 17:33, Greg Stark wrote: > >> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1 > > | half2). We have to compute (half1 | half2) anyway. > > > > Wouldn't you have to check (half1 & half2) ? > > Ah,

Re: speed up verifying UTF-8

2021-06-03 Thread Heikki Linnakangas
On 03/06/2021 17:33, Greg Stark wrote: 3. It's probably cheaper perform the HAS_ZERO check just once on (half1 | half2). We have to compute (half1 | half2) anyway. Wouldn't you have to check (half1 & half2) ? Ah, you're right of course. But & is not quite right either, it will give false

Re: speed up verifying UTF-8

2021-06-03 Thread John Naylor
On Thu, Jun 3, 2021 at 9:16 AM Heikki Linnakangas wrote: > Some ideas: > > 1. Better to check if any high bits are set first. We care more about > the speed of that than of detecting zero bytes, because input with high > bits is valid but zeros are an error. > > 2. Since we check that there are

Re: speed up verifying UTF-8

2021-06-03 Thread John Naylor
I wrote: > On Thu, Jun 3, 2021 at 10:42 AM Greg Stark wrote: > > > > If > > we're processing much more than 128 bits and happy to detect NUL > > errors only at the end after wasting some work then you could hoist > > that has_zero check entirely out of the loop (removing the branch > > though

Re: speed up verifying UTF-8

2021-06-03 Thread John Naylor
On Thu, Jun 3, 2021 at 10:42 AM Greg Stark wrote: > > I haven't looked at the surrounding code. Are we processing all the > COPY data in one long stream or processing each field individually? It happens on 64kB chunks. > If > we're processing much more than 128 bits and happy to detect NUL >

Re: speed up verifying UTF-8

2021-06-03 Thread Greg Stark
I haven't looked at the surrounding code. Are we processing all the COPY data in one long stream or processing each field individually? If we're processing much more than 128 bits and happy to detect NUL errors only at the end after wasting some work then you could hoist that has_zero check

Re: speed up verifying UTF-8

2021-06-03 Thread Greg Stark
> 3. It's probably cheaper perform the HAS_ZERO check just once on (half1 | half2). We have to compute (half1 | half2) anyway. Wouldn't you have to check (half1 & half2) ?

Re: speed up verifying UTF-8

2021-06-03 Thread Heikki Linnakangas
On 02/06/2021 19:26, John Naylor wrote: For v10, I've split the patch up into two parts. 0001 uses pure C everywhere. This is much smaller and easier to review, and gets us the most bang for the buck. One concern Heikki raised upthread is that platforms with poor unaligned-memory access will

speed up verifying UTF-8

2021-06-02 Thread John Naylor
For v10, I've split the patch up into two parts. 0001 uses pure C everywhere. This is much smaller and easier to review, and gets us the most bang for the buck. One concern Heikki raised upthread is that platforms with poor unaligned-memory access will see a regression. We could easily add an