Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Martin Frb via lazarus
On 30/12/2021 14:43, Marco van de Voort via lazarus wrote: Compile with -O4 -Cpcoreavx2 , the others (non asm) will become faster, my guess is  "add" will be about double of asm. Core I7 8700K 3.3.1 from Dec 10th 3.2.3 from Dec 9th With fpc 3.3.1: - fst is worse? - add gets better -O4

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Marco van de Voort via lazarus
On 30-12-2021 14:17, John Landmesser via lazarus wrote: Perhaps usefui test information from my PC: 77 Compile with -O4 -Cpcoreavx2 , the others (non asm) will become faster, my guess is  "add" will be about double of asm. Also, on windows "high performance" as power scheme.  On non windows

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread John Landmesser via lazarus
Perhaps usefui test information from my PC: ** [john1@manjaro sdb2]$ ./utf8lentest 234526968 fst:128406168 pop:128406168 add:128406168 asm:128406168 29315871 fst 1365 fst 1367 fst 1366 fst 1366 pop 9990 pop 9990 pop 9997 pop 9981 add 1386 add 1382 add

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Marco van de Voort via lazarus
On 30-12-2021 10:15, Florian Klämpfl via lazarus wrote: Linux uses different calling conventions, please check with the patch below. Linux is quite generous with the volatile registers, so luckily it matches quite closely. I first tried the approach of your patch, but [s] has problems on

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-30 Thread Florian Klämpfl via lazarus
Am 30.12.21 um 08:23 schrieb Alexey Tor. via lazarus: New unit test, with Martin's integrated. If I play with godbolt, Ryzen zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, so I would like to see some benchmarks from various processors. Also from very old ones (P4 and

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Alexey Tor. via lazarus
New unit test, with Martin's integrated. If I play with godbolt, Ryzen zen3 (ryzen 5x00X) is nearly twice as fast in cycles as my Ivy Bridge, so I would like to see some benchmarks from various processors. Also from very old ones (P4 and Clawhammers) to test instruction sets. Project

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus
On 30-12-2021 01:29, Marco van de Voort via lazarus wrote:  (P4 and Clawhammers) to test instruction sets. 64-bit supporting Pentium 4's of course, since this is a 64-bit only test. Claw/Sledgehammer is the first iteration of Athlon64 and of the x86_64 architecture as a whole and misses

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus
On 29-12-2021 00:00, Bart via lazarus wrote: On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus wrote: I have a core I7-8600 The diff between the old code and popcnt is less significant. old: 715 pop: 695 But there is a 3rd way, that is faster. add: 610 Not surprising that you should

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus
On 29-12-2021 16:30, Martin Frb via lazarus wrote: Could you post full source if you haven't already? For a bit of benchmarking. I just wrote it from the top of my head, and I assumed 5 instructions for 16-byte would win any time, but haven't verified anything yet. I had it attached on

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Martin Frb via lazarus
On 29/12/2021 13:42, Marco van de Voort via lazarus wrote: On 29-12-2021 10:16, Martin Frb via lazarus wrote: // Martin's routine that should be replaced by some punpkl magic, but it is too late now. Why too late? See datetime stamp.  02:10 AM. I don't know how it is with you Lazarus

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Florian Klämpfl via lazarus
Am 29.12.2021 um 13:42 schrieb Marco van de Voort via lazarus: p.s. is there a workaround for git worktree to work on the same branch? E.g. trunk for 32-bit and trunk for 64-bit ? :-) No. You cannot checkout the same branch in two worktrees. But you can do the following: create a new branch

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Marco van de Voort via lazarus
On 29-12-2021 10:16, Martin Frb via lazarus wrote: // Martin's routine that should be replaced by some punpkl magic, but it is too late now. Why too late? See datetime stamp.  02:10 AM. I don't know how it is with you Lazarus devels, but we FPC devels need our beauty sleep from time to

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-29 Thread Martin Frb via lazarus
On 29/12/2021 02:10, Marco van de Voort via lazarus wrote: On 28-12-2021 23:35, Martin Frb via lazarus wrote: "nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit). If we regard each of this bytes as an entity of its own, then we can keep adding those "1". I also was

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Marco van de Voort via lazarus
On 28-12-2021 23:35, Martin Frb via lazarus wrote: "nx" has a single "1" in each of the 8 bytes in a Qword (based on 64bit). If we regard each of this bytes as an entity of its own, then we can keep adding those "1". I also was thinking in that direction, but more about how to optimize

Re: [Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Bart via lazarus
On Tue, Dec 28, 2021 at 11:35 PM Martin Frb via lazarus wrote: > I have a core I7-8600 > The diff between the old code and popcnt is less significant. > > old: 715 > pop: 695 > > But there is a 3rd way, that is faster. > add: 610 Not surprising that you should come up with a faster solution.

[Lazarus] Faster than popcnt [[Re: UTF8LengthFast returning incorrect results on AARCH64 (MacOS)]]

2021-12-28 Thread Martin Frb via lazarus
On 28/12/2021 15:50, Bart via lazarus wrote: On Tue, Dec 28, 2021 at 3:39 PM Marco van de Voort via lazarus wrote: On what machine did you test? The settings if for the generated code, but the actual processor determines the effective speed. I have a Intel i5 7th generation on my Win10-64