[pcre-dev] [Bug 2527] Incomplete unicode handling in pcre2_substitute when converting to upper/lower case

2020-02-23 Thread admin
https://bugs.exim.org/show_bug.cgi?id=2527

--- Comment #5 from Philip Hazel  ---
I have committed (r1224) code that uses Unicode character properties for casing
operations when PCRE2_UCP is set, whether or not PCRE2_UTF is used. This
currently applies only the the interpreter (and substitute), but Zoltan will
update the JIT in due course. Also, I have not yet updated the documentation
(my next task). If you are able to test this code, please do. I think I found
all the relevant places in the code, but it's quite possible I missed
something.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Question regarding regex complexity, catastrophic backtrack and jit/no_jit

2020-02-23 Thread Zoltán Herczeg
> Matching with jit, it was very easy to produce an example which
> exceeds the available resources: We take the pattern
> "(*LIMIT_MATCH=10)(x+x+x+x+)+y" and as subject we take a string of
> length 10 containing only the letter "x".

Philip summarizes this well. In case of your example, 5 x-s are enough to reach 
the backtracking limit in both interpreter and jit:

  re> /(*LIMIT_MATCH=10)(x+x+x+x+)+y/
data> xay
Failed: error -47: match limit exceeded

JIT also searches the required character, but only if the input is shorter than 
5000 bytes (perhaps should be characters). The problem with this optimization 
is that it is a perf overhead if the character is found. A while ago I made 
measurements with enabling / disabling this optimization in real world 
examples, and it seemed better to not do it. But perhaps with simd support this 
could be improved in jit as well.

Reagrds,
Zoltan
 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev