Re: strlcpy version speed tests?
On 1/7/20 10:05 pm, Luke Small wrote: > Are you clinging to traditions for some purpose? Are you posting random pieces of code and asking for critique on them for no apparent reason for some purpose? To be clear, this was the sum and total of your first message in this thread (excluding attachment for brevity): > I made a couple different versions if anybody is interested! > -Luke Why? Why strlcpy? Why not strcpy? Or memcpy? Why not the whole libc? Zero context. The email headers and the C source code attachment are 99% of the whole email. None of those headers start with 'References:' or 'In-Reply-To:', it was a completely detached email with no link to any existing discussion, either declared explicitly or implied by its content. Your single line message seemed like it was asking: "Am I allowed to bench-test this?" As if we have the power to stop you. Go ahead, bench-test away! As to why the stock OpenBSD implementation is written a particular way? Well, likely a big part of it is wanting the code to behave the same way in multiple scenarios, e.g. gcc vs clang, AMD64 vs ARM64 vs i386 vs mips64 vs sparc vs … you get the picture. Assembly is the "fastest" option, but requires one "implementation" for each processor architecture, and receives no benefit from improvements in optimising compilers. C means it's written *once* and ideally will perform identically for all systems, whilst also being easier to understand and maintain. If a problem is found on AMD64 for example, it's merely testing a fix already committed there on other architectures to ensure they don't break. Versus fixing it about 6 or 7 times, each time figuring out how to express the same "fix" in _that_ processor's assembly dialect. I think it naïve to assume that an implementation written to run faster on one processor architecture and compiled with one compiler will universally run faster on all other processor+compiler combinations. Anyway, I've spent more words on this than I care to. So if you don't mind, I'll be instructing my email client to ignore this thread from here on in. Regards, -- Stuart Longland (aka Redhatter, VK4MSL) I haven't lost my mind... ...it's backed up on a tape somewhere.
Re: strlcpy version speed tests?
On Sat, Jul 04, 2020 at 09:07:35AM -0400, Brian Brombacher wrote: > > >> On Jul 1, 2020, at 1:14 PM, gwes wrote: > >> > >> On 7/1/20 8:05 AM, Luke Small wrote: > >> I spoke to my favorite university computer science professor who said > >> ++n is faster than n++ because the function needs to store the initial > >> value, increment, then return the stored value in the former case, > >> while the later merely increments, and returns the value. Apparently, > >> he is still correct on modern hardware. > > For decades the ++ and *p could be out of order, in different > > execution units, writes speculatively queued, assigned to aliased registers, > > etc, etc, etc. > > > > Geoff Steckel > > Hey Luke, > > I love the passion but try to focus your attention on the fact that their are > multiple architectures supported and compiler optimizations are key here. Go > with Marc’s approach using arch/ asm. Implementations can be made over time > for the various arch’s, if such an approach is desirable by the project. You > can pull a well-optimized version based on your code, for your arch, and then > slim it down a bunch. > > Cheers, > Brian > > [Not a project developer. Just an observer.] > > Another data point for consideration: the pdp11 instruction set had post-increment and pre-decrement indirect memory reference instructions. If I'm not mistaken, using pre-increment or post decrement on this architecture would impose a penalty. So your university computer science professor making such sweeping statements maybe doesn't deserve to be your favorite. -Otto
Re: strlcpy version speed tests?
>> On Jul 1, 2020, at 1:14 PM, gwes wrote: >> >> On 7/1/20 8:05 AM, Luke Small wrote: >> I spoke to my favorite university computer science professor who said >> ++n is faster than n++ because the function needs to store the initial >> value, increment, then return the stored value in the former case, >> while the later merely increments, and returns the value. Apparently, >> he is still correct on modern hardware. > For decades the ++ and *p could be out of order, in different > execution units, writes speculatively queued, assigned to aliased registers, > etc, etc, etc. > > Geoff Steckel Hey Luke, I love the passion but try to focus your attention on the fact that their are multiple architectures supported and compiler optimizations are key here. Go with Marc’s approach using arch/ asm. Implementations can be made over time for the various arch’s, if such an approach is desirable by the project. You can pull a well-optimized version based on your code, for your arch, and then slim it down a bunch. Cheers, Brian [Not a project developer. Just an observer.]
Re: strlcpy version speed tests?
On 7/1/20 8:05 AM, Luke Small wrote: I spoke to my favorite university computer science professor who said ++n is faster than n++ because the function needs to store the initial value, increment, then return the stored value in the former case, while the later merely increments, and returns the value. Apparently, he is still correct on modern hardware. For decades the ++ and *p could be out of order, in different execution units, writes speculatively queued, assigned to aliased registers, etc, etc, etc. Geoff Steckel
Re: strlcpy version speed tests?
On Wed, Jul 01, 2020 at 07:05:02AM -0500, Luke Small wrote: > Are you clinging to traditions for some purpose? I gave two different > versions. strlcpy3 is clearly more easily understood and even slightly > faster and strlcpy4 which sets up the following workhorse lines which > through timing the functions is hands down faster on my Xeon chips: > > > strlcpy4: > while (--nleft != 0) > if ((*++dst = *++src) == '\0') > ... > > the others: > > while (--nleft != 0) > if ((*dst++ = *src++) == '\0') > > ... > > > I spoke to my favorite university computer science professor who said > ++n is faster than n++ because the function needs to store the initial > value, increment, then return the > > stored value in the former case, > > while the later merely increments, and returns the value. Apparently, > he is still correct on modern hardware. If you really care about speed, you should probably look into an arch/ asm version instead
Re: strlcpy version speed tests?
Are you clinging to traditions for some purpose? I gave two different versions. strlcpy3 is clearly more easily understood and even slightly faster and strlcpy4 which sets up the following workhorse lines which through timing the functions is hands down faster on my Xeon chips: strlcpy4: while (--nleft != 0) if ((*++dst = *++src) == '\0') ... the others: while (--nleft != 0) if ((*dst++ = *src++) == '\0') ... I spoke to my favorite university computer science professor who said ++n is faster than n++ because the function needs to store the initial value, increment, then return the stored value in the former case, while the later merely increments, and returns the value. Apparently, he is still correct on modern hardware. -- -Luke
Re: strlcpy version speed tests?
I suppose this strlcpy4 without a goto is more elegant -Luke On Tue, Jun 30, 2020 at 10:07 PM Luke Small wrote: > I made it SUPER easy to test my assertion. The code is there. No > configuration needed. > > On Tue, Jun 30, 2020 at 9:59 PM Theo de Raadt wrote: > >> Luke Small wrote: >> >> > So did you run the program on one of those? >> >> Why would I? >> >> i see a sales pitch >> >> and i go BULLSHIT >> >> and I'm done >> >> -- > -Luke > #include #include #include #include #include #include /* cc strlcpy_test.c -pipe -O2 -o strlcpy_test && ./strlcpy_testfast */ /* * Copy string src to buffer dst of size dsize. At most dsize-1 * chars will be copied. Always NUL terminates (unless dsize == 0). * Returns strlen(src); if retval >= dsize, truncation occurred. */ static size_t strlcpy0(char *dst, const char *src, size_t dsize) { const char *osrc = src; size_t nleft = dsize; /* Copy as many bytes as will fit. */ if (nleft != 0) { while (--nleft != 0) { if ((*dst++ = *src++) == '\0') break; } } /* Not enough room in dst, add NUL and traverse rest of src. */ if (nleft == 0) { if (dsize != 0) *dst = '\0'; /* NUL-terminate dst */ while (*src++) ; } return(src - osrc - 1); /* count does not include NUL */ } static size_t strlcpy3(char *dst, const char *src, size_t dsize) { const char *osrc = src; size_t nleft = dsize; if (nleft != 0) { /* Copy as many bytes as will fit. */ while (--nleft != 0) if ((*dst++ = *src++) == '\0') return(src - osrc - 1); *dst = '\0'; } /* Not enough room in dst, traverse rest of src. */ while (*src++) ; return(src - osrc - 1); /* count does not include NUL */ } static size_t strlcpy4(char dst[], const char src[], size_t dsize) { const char *osrc = src; size_t nleft = dsize; if (nleft != 0) { if (--nleft == 0) { *dst = '\0'; /* NUL-terminate dst */ if (*src == '\0') return 0; } else { /* Copy as many bytes as will fit. */ if ((*dst = *src) == '\0') return 0; while (--nleft != 0) if ((*++dst = *++src) == '\0') return(src - osrc); dst[1] = '\0'; /* NUL-terminate dst */ } } else if (*src == '\0') return 0; /* Not enough room in dst, traverse rest of src. */ while (*++src) ; return(src - osrc); /* count does not include NUL */ } int main() { long double cpu_time_used; size_t y; struct timespec tv_start, tv_end; char *buffer, *buffer2; size_t n = 5; size_t m = n + 50; buffer = malloc(m); if (buffer == NULL) err(1, "malloc"); buffer2 = malloc(n); if (buffer2 == NULL) err(1, "malloc"); /* no intermediate '\0' */ for (y = 0; y < m; ++y) buffer[y] = arc4random_uniform(255) + 1; buffer[m - 1] = '\0'; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy0(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy0\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy3(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy3\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy4(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy4\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); return 0; }
Re: strlcpy version speed tests?
On 1/7/20 11:18 am, Luke Small wrote: > I made a couple different versions if anybody is interested! You don't need our permission… -- Stuart Longland (aka Redhatter, VK4MSL) I haven't lost my mind... ...it's backed up on a tape somewhere.
strlcpy version speed tests?
I made a couple different versions if anybody is interested! -Luke #include #include #include #include #include #include /* cc strlcpy_test.c -pipe -O2 -o strlcpy_test && ./strlcpy_testfast */ /* * Copy string src to buffer dst of size dsize. At most dsize-1 * chars will be copied. Always NUL terminates (unless dsize == 0). * Returns strlen(src); if retval >= dsize, truncation occurred. */ static size_t strlcpy0(char *dst, const char *src, size_t dsize) { const char *osrc = src; size_t nleft = dsize; /* Copy as many bytes as will fit. */ if (nleft != 0) { while (--nleft != 0) { if ((*dst++ = *src++) == '\0') break; } } /* Not enough room in dst, add NUL and traverse rest of src. */ if (nleft == 0) { if (dsize != 0) *dst = '\0'; /* NUL-terminate dst */ while (*src++) ; } return(src - osrc - 1); /* count does not include NUL */ } static size_t strlcpy3(char *dst, const char *src, size_t dsize) { const char *osrc = src; size_t nleft = dsize; if (nleft != 0) { /* Copy as many bytes as will fit. */ while (--nleft != 0) if ((*dst++ = *src++) == '\0') return(src - osrc - 1); *dst = '\0'; } /* Not enough room in dst, traverse rest of src. */ while (*src++) ; return(src - osrc - 1); /* count does not include NUL */ } static size_t strlcpy4(char dst[], const char src[], size_t dsize) { const char *osrc = src; size_t nleft = dsize; if (nleft != 0) { if (--nleft == 0) { *dst = '\0'; if (*src == '\0') return 0; goto strlcpy_jump; } /* Copy as many bytes as will fit. */ if ((*dst = *src) == '\0') return 0; while (--nleft != 0) if ((*++dst = *++src) == '\0') return(src - osrc); dst[1] = '\0'; /* NUL-terminate dst */ } else if (*src == '\0') return 0; strlcpy_jump: /* Not enough room in dst, traverse rest of src. */ while (*++src) ; return(src - osrc); /* count does not include NUL */ } int main() { long double cpu_time_used; size_t y; struct timespec tv_start, tv_end; char *buffer, *buffer2; size_t n = 5; size_t m = n + 500; buffer = malloc(m); if (buffer == NULL) err(1, "malloc"); buffer2 = malloc(n); if (buffer2 == NULL) err(1, "malloc"); /* no intermediate '\0' */ for (y = 0; y < m; ++y) buffer[y] = arc4random_uniform(255) + 1; buffer[m - 1] = '\0'; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy0(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy0\n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy3(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy3 \n"); printf("time = %.9Lf\n\n\n", cpu_time_used); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _start); strlcpy4(buffer2, buffer, n); clock_gettime(CLOCK_PROCESS_CPUTIME_ID, _end); cpu_time_used = (long double) (tv_end.tv_sec - tv_start.tv_sec) + (long double) (tv_end.tv_nsec - tv_start.tv_nsec) / (long double) 10; printf("\n\nstrlcpy4 \n"); printf("time = %.9Lf\n\n\n", cpu_time_used); return 0; }