[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 qinzhao at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #48 from qinzhao at gcc dot gnu.org --- all the issues have been resolved. close this one as fixed.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #47 from qinzhao at gcc dot gnu.org --- all the issues triggered by the previous patch have been fixed. I am planing to close this PR as fixed.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #46 from qinzhao at gcc dot gnu.org --- https://gcc.gnu.org/viewcvs/gcc?view=revision=263028 was to fix the optimization level issue.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.2 |8.3 --- Comment #45 from Jakub Jelinek --- GCC 8.2 has been released.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #44 from Qing Zhao --- > (In reply to wilco from comment #43) will provide a simple patch for this issue.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #43 from Wilco --- (In reply to qinzhao from comment #42) > just checked in the patch for fixing the unsigned char issue as: > https://gcc.gnu.org/viewcvs/gcc?view=revision=262907 That looks it is using unsigned char accesses indeed. One more thing: the expansion happens both with -Os and -O0, and that shouldn't happen. The memcmp expansion works from -O2 onwards, which is probably a good choice for the strcmp too (a single char could be optimized with -Os but that would be the empty string).
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #42 from qinzhao at gcc dot gnu.org --- just checked in the patch for fixing the unsigned char issue as: https://gcc.gnu.org/viewcvs/gcc?view=revision=262907
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #41 from Qing Zhao --- > --- Comment #40 from Wilco --- > See eg. http://www.iso-9899.info/n1570.html section 7.24.4: > > "The sign of a nonzero value returned by the comparison functions memcmp, > strcmp, and strncmp is determined by the sign of the difference between the > values of the first pair of characters (both interpreted as unsigned char) > that > differ in the objects being compared." Thanks. I will provide a small patch to fix this issue soon.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #40 from Wilco --- (In reply to Qing Zhao from comment #39) > > --- Comment #38 from Wilco --- > > This uses signed char while the C standard says the comparison is done on > > unsigned chars. > > > > during my implementation, I did some research on whether I should use > “unsigned char” or “signed char” > for the comparison. what I checked was man page of strcmp, memcmp, (I don’t > have C standard in hand). > in the manpage of memcmp, it clearly and explicitly mentioned that the chars > are interpreted as unsigned char; > however, in the manpage of strcmp/strncmp, it’s not mentioned at all. So, I > thought that for strcmp/strncmp, > I should use signed char. but for memcmp, I used unsigned char. > > since I don’t have a C standard, could you please point me the corresponding > section for this? > thanks. See eg. http://www.iso-9899.info/n1570.html section 7.24.4: "The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared."
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #39 from Qing Zhao --- > --- Comment #38 from Wilco --- > This uses signed char while the C standard says the comparison is done on > unsigned chars. > during my implementation, I did some research on whether I should use “unsigned char” or “signed char” for the comparison. what I checked was man page of strcmp, memcmp, (I don’t have C standard in hand). in the manpage of memcmp, it clearly and explicitly mentioned that the chars are interpreted as unsigned char; however, in the manpage of strcmp/strncmp, it’s not mentioned at all. So, I thought that for strcmp/strncmp, I should use signed char. but for memcmp, I used unsigned char. since I don’t have a C standard, could you please point me the corresponding section for this? thanks.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #38 from Wilco --- (In reply to qinzhao from comment #37) > since all the implementation were in trunk. > can I close this PR now? Thanks, it generates pretty much what I expected for t1. However there is an issue: t1: ldrsb w1, [x0] subsw1, w1, #97 bne .L68 ldrsb w1, [x0, 1] .L68: mov w0, w1 ret This uses signed char while the C standard says the comparison is done on unsigned chars.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #37 from qinzhao at gcc dot gnu.org --- since all the implementation were in trunk. can I close this PR now?
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #36 from qinzhao at gcc dot gnu.org --- the 3rd part (the last part) of this PR was checked into GCC 9 today as: https://gcc.gnu.org/viewcvs/gcc?view=revision=262636
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 qinzhao at gcc dot gnu.org changed: What|Removed |Added CC||qinzhao at gcc dot gnu.org --- Comment #35 from qinzhao at gcc dot gnu.org --- the second part of the implementation is checked into gcc9 today as: https://gcc.gnu.org/viewcvs/gcc?limit_changes=0=revision=261039
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Jakub Jelinek changed: What|Removed |Added Target Milestone|8.0 |8.2 --- Comment #34 from Jakub Jelinek --- GCC 8.1 has been released.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #33 from Qing Zhao --- with SPEC2017 on X86, I didn't see any run-time performance change as expected. due to the date I collected for code size change, looks like that n=3 or n=4 might be a reasonable default. n=3 might be safer.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #32 from Qing Zhao --- Created attachment 43298 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43298=edit code size impact from inlining str(n)cmp with different n on X86
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #31 from Wilco --- (In reply to Qing Zhao from comment #30) > (in reply to Wilco from comment #29) > > > > The new test is better, however it uses i % 15 which means an expensive > > division by constant every loop iteration. It's best to change to i & 15. > > Also > > using an array of string pointers means you get something like: > > > > result += strcmp (p[i & 15], "abc"); > > > > Using this I get ~80% speedup for n=3 on AArch64, similar to your set 2. > I will try with these modification. > > > > As for benchmarking, I'm not so sure that SPEC2006 or SPEC2017 call strcmp > > with > > constant strings. > do you have any suggestion on other real applications? Not really - I haven't seen strcmp with a constant string in a benchmark other than Dhrystone (and that has a long string). What I typically do is get traces from running various benchmarks and create a microbenchmark that mimicks the behaviour, but that's probably overkill for this optimization.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #30 from Qing Zhao --- (in reply to Wilco from comment #29) > > The new test is better, however it uses i % 15 which means an expensive > division by constant every loop iteration. It's best to change to i & 15. Also > using an array of string pointers means you get something like: > > result += strcmp (p[i & 15], "abc"); > > Using this I get ~80% speedup for n=3 on AArch64, similar to your set 2. I will try with these modification. > > As for benchmarking, I'm not so sure that SPEC2006 or SPEC2017 call strcmp > with > constant strings. do you have any suggestion on other real applications?
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #29 from Wilco --- (In reply to Qing Zhao from comment #28) > > > > I don't think this is a good test. Repeatedly calling strcmp with the same > > inputs is not something real code does, especially when the string matches > > exactly. Why would it do that? The consequence is that you are probably > > optimizing for a very unlikely scenario. Instead you need to look at traces > > from real code to understand how strcmp is called in practice. Nothing else > > will give you a defensible answer. > > Please see the latest attachment to this bug, it addressed your above > concerns. > let me know if you have more concerns with the new testing cases or any > concrete > suggestion to improve the testing case. > The next step will do performance testing with SPEC2006 or SPEC2017 to > further > decide the default threshold. The new test is better, however it uses i % 15 which means an expensive division by constant every loop iteration. It's best to change to i & 15. Also using an array of string pointers means you get something like: result += strcmp (p[i & 15], "abc"); Using this I get ~80% speedup for n=3 on AArch64, similar to your set 2. As for benchmarking, I'm not so sure that SPEC2006 or SPEC2017 call strcmp with constant strings.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #28 from Qing Zhao --- > > I don't think this is a good test. Repeatedly calling strcmp with the same > inputs is not something real code does, especially when the string matches > exactly. Why would it do that? The consequence is that you are probably > optimizing for a very unlikely scenario. Instead you need to look at traces > from real code to understand how strcmp is called in practice. Nothing else > will give you a defensible answer. Please see the latest attachment to this bug, it addressed your above concerns. let me know if you have more concerns with the new testing cases or any concrete suggestion to improve the testing case. The next step will do performance testing with SPEC2006 or SPEC2017 to further decide the default threshold.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #27 from Richard Earnshaw --- (In reply to Qing Zhao from comment #23) > qinzhao@gcc116:~/Bugs/78809/const_cmp$ cat t_p.c > #include > > char array[]= "fishi"; > > #define NUM 10 > int __attribute__ ((noinline)) > cmp2 (const char *p) > { > int result = 0; > int i; > for (i=0; i< NUM; i++) { > result |= strcmp (p, "fishi"); > } > return result; > } > > int result = 0; > > int main() > { > for (int i = 0; i < 25; i++) > result += cmp2 (array); > return 0; > } I don't think this is a good test. Repeatedly calling strcmp with the same inputs is not something real code does, especially when the string matches exactly. Why would it do that? The consequence is that you are probably optimizing for a very unlikely scenario. Instead you need to look at traces from real code to understand how strcmp is called in practice. Nothing else will give you a defensible answer.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Qing Zhao changed: What|Removed |Added Attachment #42449|0 |1 is obsolete|| --- Comment #26 from Qing Zhao --- Created attachment 43222 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43222=edit the run-time performance data for inlining str(n)cmp This is the performance data with my private str(n)cmp implementation on both X86 and AARCH64 platform.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #25 from Wilco --- (In reply to Qing Zhao from comment #24) > From the above, we can see: > even with n is as big as 20, inlined version is much faster than the > non-inlined version, both on aarch64 (no hardware string compare insn > provided) and X86 (hardware string compare insn provided) > > So, it's reasonable to do the inline as much as possible. Those numbers look too good to be true. I get around 2x speed up at n=3 (using old GLIBC header which does the inlining), you get 7x on AArch64 and 21x on x86... In general it's not a good idea to inline too much because of code size bloat and branch mispredictions (a good strcmp implementation processes 8 or 16 chars at a time rather than requiring 1 branch per character). n=4 seems reasonable since you need 3 instructions per character on most targets.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #24 from Qing Zhao --- From the above, we can see: even with n is as big as 20, inlined version is much faster than the non-inlined version, both on aarch64 (no hardware string compare insn provided) and X86 (hardware string compare insn provided) So, it's reasonable to do the inline as much as possible.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #23 from Qing Zhao --- I have an implementation for the part C of this task in my private space: part C: for strcmp (s1, s2), strncmp (s1, s2, n): if the result is NOT used to do simple equality test against zero, one of "s1" or "s2" is a small constant string, n is a constant, and the Min value of the length of the constant string and "n" is smaller than a predefined threshold T, inline the call by a byte-to-byte comparision sequence to avoid calling overhead. with this implementation, I was able to measure the performance impact of the inlining transformation on different value of "n", n is the length of the string need to be compared. In order to decide the following two concerns: A. what's the default value of n. B. on a platform that support string compare hardware insns (for exmaple, X86), which should be done first for a call to strcmp/strncmp, the inline or the hardware insns? On both aarch64 and X86, I tried the following small testing case for the performance experiments: qinzhao@gcc116:~/Bugs/78809/const_cmp$ cat t_p.c #include char array[]= "fishi"; #define NUM 10 int __attribute__ ((noinline)) cmp2 (const char *p) { int result = 0; int i; for (i=0; i< NUM; i++) { result |= strcmp (p, "fishi"); } return result; } int result = 0; int main() { for (int i = 0; i < 25; i++) result += cmp2 (array); return 0; } and the option I used was: -O -fno-tree-loop-im and the corresponding option to enable or disable the added inlining, the following is the performance result: aarch64 strcmp n= 3 4 5 6 10 20 inline 31 41 62 72 114 242 no-inline 229 229 229 229 272 333 aarch64 strncmp n= 3 4 5 6 10 20 inline 41 62 62 76 125 250 no-inline 291 291 291 291 364 427 X86 strcmp n= 4 5 6 10 20 inline 21 25 31 42 163 no-inline 445 461 488 529 672 X86 strncmp n= 4 5 6 10 20 inline 21 25 28 43 77 no-inline 412 435 442 495 638 From the above, we can see:
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #22 from Qing Zhao --- https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00962.html 2nd patch
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #21 from Jeffrey A. Law --- Author: law Date: Fri Nov 17 05:32:05 2017 New Revision: 254856 URL: https://gcc.gnu.org/viewcvs?rev=254856=gcc=rev Log: 2017-11-15 Qing ZhaoPR middle-end/78809 * gimple-fold.c (gimple_fold_builtin_string_compare): Add handling of replacing call to strncmp with corresponding call to strcmp when meeting conditions. PR middle-end/78809 * gcc.dg/strcmpopt_1.c: New test. Added: trunk/gcc/testsuite/gcc.dg/strcmpopt_1.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-fold.c trunk/gcc/testsuite/ChangeLog
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #20 from Qing Zhao --- design doc is at: https://www.mail-archive.com/gcc@gcc.gnu.org/msg83822.html
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Paolo Carlini changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-10-26 CC|qing.zhao at oracle dot com| Assignee|unassigned at gcc dot gnu.org |qing.zhao at oracle dot com Target Milestone|--- |8.0 Ever confirmed|0 |1 --- Comment #19 from Paolo Carlini --- Qing is on it.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #18 from Wilco --- (In reply to Qing Zhao from comment #17) > (In reply to Wilco from comment #16) > > >> const char s[8] = “abcd\0abc”; // null byte in the middle of the string > >> int f2(void) { return __builtin_strcmp(s, "abc") != 0; } > >> int f3(void) { return __builtin_strcmp(s, “abc”); } > >> > >> can either of the above f2 or f3 been optimized to memcmp? seems not. > > > > You never get that to the null byte as the memcmp only compares > > strlen("abc"+1) > > characters. > > Yes, this is correct for memcmp, but not for strcmp. strcmp will get to the > null byte. > as a result, > > const char s[8] = “abcd\0abc”; > strcmp (s, “abc”) != 0. // s = “abcd", which is != “abc" > strncmp (s, “abc”, 3) == 0 > memcmp(s, “abc”, 3) == 0 > > So, strcmp cannot optimized to memcmp No that should be: strcmp (s, “abc”) != 0 strncmp (s, “abc”, 4) != 0 memcmp(s, “abc”, 4) != 0 You need to compare the null terminator as well. > > However do you mean an input string which is shorter than the > > constant string? That's fine as this will compare not-equal in the memcmp. > > for the input string is shorter than the constant string, for example: > > const char s[8] = “ab\0\0abcd”; > strcmp (s, “abc”) != 0 > strncmp (s, “abc”, 3) != 0 > memcmp (s, “abc”,3) != 0 > > In a summary, since it’s NOT easy for the compiler to know where is the > “Null_terminator” > in the string, strcmp is NOT reasonable to be optimized to memcmp whenever > its result is > used to compare with zero or not. The compiler knows where the null terminator is in the constant string so it can easily figure out when it is legal as well as faster than doing a byte by byte expansion of strcmp. strcmp (s, STR) -> memcmp (s, strlen (STR) + 1) iff max(sizeof_array(s), alignment(s)) > strlen (STR). > But for strncmp, if the result is to compare with zero, it might be > reasonable to optimized it > to the corresponding memcmp, i.e > > strncmp (s, “abc”, 3) != 0 > > could be optimized to > > memcmp (s, “abc”, 3) != 0 If the strncmp size is smaller than the strlen of the constant string (and alignment is right), yes. But strncmp (s, "abc", C) is equivalent to strcmp (s, "abc") if C >= 4.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #17 from Qing Zhao --- (In reply to Wilco from comment #16) >> const char s[8] = “abcd\0abc”; // null byte in the middle of the string >> int f2(void) { return __builtin_strcmp(s, "abc") != 0; } >> int f3(void) { return __builtin_strcmp(s, “abc”); } >> >> can either of the above f2 or f3 been optimized to memcmp? seems not. > > You never get that to the null byte as the memcmp only compares > strlen("abc"+1) > characters. Yes, this is correct for memcmp, but not for strcmp. strcmp will get to the null byte. as a result, const char s[8] = “abcd\0abc”; strcmp (s, “abc”) != 0. // s = “abcd", which is != “abc" strncmp (s, “abc”, 3) == 0 memcmp(s, “abc”, 3) == 0 So, strcmp cannot optimized to memcmp > However do you mean an input string which is shorter than the > constant string? That's fine as this will compare not-equal in the memcmp. for the input string is shorter than the constant string, for example: const char s[8] = “ab\0\0abcd”; strcmp (s, “abc”) != 0 strncmp (s, “abc”, 3) != 0 memcmp (s, “abc”,3) != 0 In a summary, since it’s NOT easy for the compiler to know where is the “Null_terminator” in the string, strcmp is NOT reasonable to be optimized to memcmp whenever its result is used to compare with zero or not. But for strncmp, if the result is to compare with zero, it might be reasonable to optimized it to the corresponding memcmp, i.e strncmp (s, “abc”, 3) != 0 could be optimized to memcmp (s, “abc”, 3) != 0
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #16 from Wilco --- (In reply to Qing Zhao from comment #15) > (In reply to Wilco from comment 14) > > The only reason we have to do a character by character comparison is > > because we > > cannot read beyond the end of a string. However when we know the size or > > alignment we can safely process a string one word at a time. > > is it possible that “NULL_terminator” is in the middle of the string even > though we > know the size or alignment? for example: > > const char s[8] = “abcd\0abc”; // null byte in the middle of the string > int f2(void) { return __builtin_strcmp(s, "abc") != 0; } > int f3(void) { return __builtin_strcmp(s, “abc”); } > > can either of the above f2 or f3 been optimized to memcmp? seems not. You never get that to the null byte as the memcmp only compares strlen("abc"+1) characters. However do you mean an input string which is shorter than the constant string? That's fine as this will compare not-equal in the memcmp.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #15 from Qing Zhao --- (In reply to Wilco from comment 14) > The only reason we have to do a character by character comparison is because > we > cannot read beyond the end of a string. However when we know the size or > alignment we can safely process a string one word at a time. is it possible that “NULL_terminator” is in the middle of the string even though we know the size or alignment? for example: const char s[8] = “abcd\0abc”; // null byte in the middle of the string int f2(void) { return __builtin_strcmp(s, "abc") != 0; } int f3(void) { return __builtin_strcmp(s, “abc”); } can either of the above f2 or f3 been optimized to memcmp? seems not.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #14 from Wilco --- (In reply to Qing Zhao from comment #11) > (In reply to Wilco from comment #9) > > > str(n)cmp with a constant string can be changed into memcmp if the string > > has a > > known alignment or is an array of known size. We should check the common > > cases > > are implemented. > > Please provide an example in which a str(n)cmp with a constant string can be > changed into > memcmp. > > (From my understanding, for the strcmp (p, “fish”), since we don’t know > what will be in the string > pointed by “p”, and there might be NULL_terminator in any of the place of p, > we have to compare > each char in “p” one by one, do I miss anything here?) The only reason we have to do a character by character comparison is because we cannot read beyond the end of a string. However when we know the size or alignment we can safely process a string one word at a time. I would expect all these to optimize into memcmp - but none do: #include char s[100]; typedef struct { int x; char s[8]; } S; int f1(S *s) { return __builtin_strcmp(s->s, "abc") != 0; } int f2(void) { return __builtin_strcmp(s, "abc") != 0; } int f3(char *s) { s = __builtin_assume_aligned (s, 8); return __builtin_strcmp(s, "abc") != 0; } int f4(void) { char *s = (char*)malloc(100); return __builtin_strcmp(s, "abc") != 0; }
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #13 from Qing Zhao --- (In reply to Richard Earnshaw from comment 12) > That's not entirely correct. Notionally memcmp needs to return a value > representing the relative difference of the first different byte in the > compared areas of memory; any later bytes are irrelevant. Yes the compiler > can > compare multiple bytes at the same time and it does not have to worry about > page faulting, but it does have to keep track of where the first difference > occurs. > > Of course, the compiler can see how the result is used to optimize things > further; a simple equality test will allow the compiler to generate a simpler > sequence that could access all bytes and accumulate the overall result. Yes. this should be the reason why currently in GCC, only when the result of memcpy is compared to zero, it is optimized by “strlen” pass and “expand” pass.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #12 from Richard Earnshaw --- (In reply to Qing Zhao from comment #7) > on the other hand, memcmp will NOT early stop, it will compare exactly N > bytes of both buffers. As a result, the compiler can compare multiple bytes > at one time. > That's not entirely correct. Notionally memcmp needs to return a value representing the relative difference of the first different byte in the compared areas of memory; any later bytes are irrelevant. Yes the compiler can compare multiple bytes at the same time and it does not have to worry about page faulting, but it does have to keep track of where the first difference occurs. Of course, the compiler can see how the result is used to optimize things further; a simple equality test will allow the compiler to generate a simpler sequence that could access all bytes and accumulate the overall result.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #11 from Qing Zhao --- (In reply to Wilco from comment #9) > str(n)cmp with a constant string can be changed into memcmp if the string has > a > known alignment or is an array of known size. We should check the common cases > are implemented. Please provide an example in which a str(n)cmp with a constant string can be changed into memcmp. (From my understanding, for the strcmp (p, “fish”), since we don’t know what will be in the string pointed by “p”, and there might be NULL_terminator in any of the place of p, we have to compare each char in “p” one by one, do I miss anything here?)
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #10 from Qing Zhao --- >> From the data, we can see the inlined version of strcmp (by glibc) is much >> slower than the direct call to strcmp. (this is for size 2) >> I am using GCC farm machine gcc116: > > This result doesn't make sense - it looks like GCC is moving the strcmp call > in > the 2nd case as a loop invariant, so you're just measuring a loop with just a > subtract and orr instruction… Yes, Wilco is right here. -ftree-loop-im moves the call to strcmp out of the loop. in order to avoid this issue, I changed the options to -O -fno-tree-loop-im and checked the assembly of the routine “cmp2” for the INLINED and Non-INLINED version. Inlined version: cmp2: mov x4, x0 mov w2, 51712 movkw2, 0x3b9a, lsl 16 mov w0, 0 mov w3, 102 b .L3 .L2: neg w1, w1 orr w0, w0, w1 subsw2, w2, #1 beq .L5 .L3: ldrbw1, [x4] subsw1, w3, w1 bne .L2 ldrbw1, [x4, 1] neg w1, w1 b .L2 .L5: ret Non-inlined version: cmp2: stp x29, x30, [sp, -48]! add x29, sp, 0 stp x19, x20, [sp, 16] stp x21, x22, [sp, 32] mov x22, x0 mov w19, 51712 movkw19, 0x3b9a, lsl 16 mov w20, 0 adrpx21, .LC0 add x21, x21, :lo12:.LC0 .L2: mov x1, x21 mov x0, x22 bl strcmp orr w20, w20, w0 subsw19, w19, #1 bne .L2 mov w0, w20 ldp x19, x20, [sp, 16] ldp x21, x22, [sp, 32] ldp x29, x30, [sp], 48 ret Then, the run-time performance data is: qinzhao@gcc116:~/Bugs/78809/const_cmp/perf$ sh t_p /home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c -DINLINED inlined version 34.73user 0.00system 0:34.73elapsed 99%CPU (0avgtext+0avgdata 360maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps /home/qinzhao/Install/latest/bin/gcc -O -fno-tree-loop-im t_p_1.c t_p.c non-inlined version 138.79user 0.00system 2:18.77elapsed 100%CPU (0avgtext+0avgdata 356maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps Yes, looks like that the inlined version is much faster than the non-inlined version on aarch64 platform.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #9 from Wilco --- (In reply to Qing Zhao from comment #7) str(n)cmp with a constant string can be changed into memcmp if the string has a known alignment or is an array of known size. We should check the common cases are implemented. Given it takes just 3 instructions per character on almost all targets, it is reasonable to inline strcmp with a small string if it cannot be changed into memcmp. This will be faster than calling strcmp due to the overhead of the call, the plt redirection and alignment checks. Note also that many strcmp implementations will do a byte-by-byte check if the strings are not aligned.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #8 from Wilco --- > /home/qinzhao/Install/latest/bin/gcc -O2 t_p_1.c t_p.c > non-inlined version > 20.84user 0.00system 0:20.83elapsed 100%CPU (0avgtext+0avgdata > 360maxresident)k > 0inputs+0outputs (0major+135minor)pagefaults 0swaps > > From the data, we can see the inlined version of strcmp (by glibc) is much > slower than the direct call to strcmp. (this is for size 2) > I am using GCC farm machine gcc116: This result doesn't make sense - it looks like GCC is moving the strcmp call in the 2nd case as a loop invariant, so you're just measuring a loop with just a subtract and orr instruction...
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #7 from Qing Zhao --- I have studied the inlining of memcmp and str(n)cmp in GCC, the following are a summary of my study so far: 1. memcmp is different from str(n)cmp as the following: • strcmp compares null-terminated C strings • strncmp compares at most N characters of null-terminated C strings • memcmp compares binary byte buffers of N bytes. among these three, both strcmp and strncmp might early stop at NULL terminator of the compared strings. as a result, we have to compare the char one by one for str(n)cmp. on the other hand, memcmp will NOT early stop, it will compare exactly N bytes of both buffers. As a result, the compiler can compare multiple bytes at one time. So, memcpy should be much easier for the compiler to optimize than str(n)cmp. 2. when the memcpy’s result is only compared against zero, it's easier to be optimized than its result compared with other values. 3. the implementation of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52171 have been in latest GCC, and it optimizes the following cases on ALL platforms: memcmp (p, "fishi", 5) != 0 __builtin_memcmp (p, "fish", 8) == 0 and the implementation is a target independent one. when the length of constant string is multiple word size, it is optimized by “strlen” pass, when it’s NOT multiple word size, it is optimized by “expand” pass. 4. However, this implementation to PR52171 does NOT optimize the following cases: A. memcmp’s result is NOT compared to zero memcmp (p, "fish", 4); __builtin_memcmp (p, “fishi”, 5); B. strncmp and strcmp, when the result is or is NOT compared to zero strncmp (p, "fi", 2) != 0 __builtin_strncmp (p, "fi", 2) != 0 strcmp (p, "fish") != 0 __builtin_strcmp (p, "fish") != 0 strncmp (p, "fi", 2) __builtin_strncmp (p, "fi", 2) strcmp (p, "fish”) __builtin_strcmp (p, "fish”) Per the reason I mentioned in 1. strncmp and strcmp can ONLY compare the char one by one, the implementation of PR52171 can NOT be extended for str(n)cmp. 5. currently, glibc optimizes strcmp with constant string up to size 3 in the header as Wilco mentioned in the Description part of the PR. the optimization is as following: strcmp (p, “fis”) will be transformed to: D.2234 = __s2_len = 3;, __s2_len <= 3; ? TARGET_EXPR1 && __result == 0) { __result = (int) *(__s1 + 2) - (int) *((const unsigned char *) "fis" + 2); if (__s2_len > 2 && __result == 0) { __result = (int) *(__s1 + 3) - (int) *((const unsigned char *) "fis" + 3); } } } } D.2233 = __result; }> : __builtin_strcmp ((const char *) p, (const char *) "fis"); I.e, each character is compared one by one, and the result of the previous comparison is used to control whether the next char should be compared or not, as a result, comparing of one char needs one load, one compare and one branch. not very cheap operations. therefore, we cannot apply the above optimization on longer strings even though its constant string. The Request of this PR is to move the above transformation from Glibc into GCC. However, my run-time performance data showed in Comment 5 and 6, shows: the glibc optimization is SLOWER than the direct call to strcmp on aarch64. 6. On the other hand, on a platform that has hardward strcmp insn, for example, X86 platform, expand the str(n)cmp to use the hardware strcmp insn will be very fast. GCC currently has done this for several platforms. However, for a platform that does NOT have hardware strcmp insns, for example, aarch64, inline str(n)cmp might NOT be a good idea, I think.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #6 from Qing Zhao --- (A correction to comment 4: when adding #include , the call to strcmp(s, "a") was inlined by glibc in the header with the latest upstream GCC ) with attached testing case, on GCC farm machine gcc116, I got the following run-time performance data: qinzhao@gcc116:~/Bugs/78809/const_cmp$ sh t_p /home/qinzhao/Install/latest/bin/gcc -O2 t_p_1.c t_p.c -DINLINED inlined version 41.67user 0.00system 0:41.67elapsed 99%CPU (0avgtext+0avgdata 360maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps /home/qinzhao/Install/latest/bin/gcc -O2 t_p_1.c t_p.c non-inlined version 20.84user 0.00system 0:20.83elapsed 100%CPU (0avgtext+0avgdata 360maxresident)k 0inputs+0outputs (0major+135minor)pagefaults 0swaps >From the data, we can see the inlined version of strcmp (by glibc) is much slower than the direct call to strcmp. (this is for size 2) I am using GCC farm machine gcc116: qinzhao@gcc116:~/Bugs/78809/const_cmp$ uname -a Linux gcc116 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:45:34 UTC 2016 aarch64 aarch64 aarch64 GNU/Linux
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #5 from Qing Zhao --- Created attachment 42449 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42449=edit Run time performance testing case on aarch64 this is for testing the run-time performance of inlined version of strcmp and non-inlined version of strcmp on aarch64. to run the test: 0. on an aarch64 machine; 1. untar the tar file 2. cd perf_78809 3. sh t_p
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #4 from Wilco --- (In reply to Qing Zhao from comment #3) > with the latest upstream GCC, we got the following assembly for the testing > case with -O2 on aarch64: > > t1: > adrpx1, .LC0 > add x1, x1, :lo12:.LC0 > b strcmp > > t2: > adrpx1, .LC0 > add x1, x1, :lo12:.LC0 > b strcmp Yes the inlining was recently removed from GLIBC since the goal is to inline in GCC.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Qing Zhao changed: What|Removed |Added CC||qing.zhao at oracle dot com --- Comment #3 from Qing Zhao --- with the latest upstream GCC, we got the following assembly for the testing case with -O2 on aarch64: t1: adrpx1, .LC0 add x1, x1, :lo12:.LC0 b strcmp t2: adrpx1, .LC0 add x1, x1, :lo12:.LC0 b strcmp As I checked on X86, the same testing case has the following assembly with -O2: t1: movq%rdi, %rsi movl$2, %ecx movl$.LC0, %edi repz; cmpsb seta%al setb%dl subl%edx, %eax movsbl %al, %eax ret t2: movq%rdi, %rsi movl$2, %ecx movl$.LC0, %edi repz; cmpsb seta%al setb%dl subl%edx, %eax movsbl %al, %eax ret on X86, both routines have the exact assembly and both are inlined by GCC.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 --- Comment #2 from wilco at gcc dot gnu.org --- (In reply to Richard Biener from comment #1) > We may have dups of this. And we now have inlining for strcmp/memcmp when > the > result is only compared against zero. I don't see that happening with latest trunk (maybe patches are still outstanding?), however if so then that code could be generalized to support any strcmp.
[Bug middle-end/78809] Inline strcmp with small constant strings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78809 Richard Biener changed: What|Removed |Added Keywords||missed-optimization --- Comment #1 from Richard Biener --- We may have dups of this. And we now have inlining for strcmp/memcmp when the result is only compared against zero.