[pcre-dev] [Bug 2527] Incomplete unicode handling in pcre2_substitute when converting to upper/lower case
https://bugs.exim.org/show_bug.cgi?id=2527 --- Comment #5 from Philip Hazel --- I have committed (r1224) code that uses Unicode character properties for casing operations when PCRE2_UCP is set, whether or not PCRE2_UTF is used. This currently applies only the the interpreter (and substitute), but Zoltan will update the JIT in due course. Also, I have not yet updated the documentation (my next task). If you are able to test this code, please do. I think I found all the relevant places in the code, but it's quite possible I missed something. -- You are receiving this mail because: You are on the CC list for the bug. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
Re: [pcre-dev] Question regarding regex complexity, catastrophic backtrack and jit/no_jit
> Matching with jit, it was very easy to produce an example which > exceeds the available resources: We take the pattern > "(*LIMIT_MATCH=10)(x+x+x+x+)+y" and as subject we take a string of > length 10 containing only the letter "x". Philip summarizes this well. In case of your example, 5 x-s are enough to reach the backtracking limit in both interpreter and jit: re> /(*LIMIT_MATCH=10)(x+x+x+x+)+y/ data> xay Failed: error -47: match limit exceeded JIT also searches the required character, but only if the input is shorter than 5000 bytes (perhaps should be characters). The problem with this optimization is that it is a perf overhead if the character is found. A while ago I made measurements with enabling / disabling this optimization in real world examples, and it seemed better to not do it. But perhaps with simd support this could be improved in jit as well. Reagrds, Zoltan -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev