Re: [pcre-dev] Support for invalid UTF-8 strings?

2018-04-13 Thread Milan Bouchet-Valat
Hi,
Thanks for the detailed reply, that's very useful. To be honest, I
won't work on implementing this myself, but it's important to know
what's possible to implement when designing APIs.
I think it would be OK for Julia to check whether a string is valid
UTF-8 beforehand (as PCRE currently does), and fall back to a slow path
if it's not. Of course the slow path shouldn't make the standard path
slower, and ideally code duplication would be limited, which might not
be easy. Or maybe the string could be made valid before passing it to
PCRE, replacing invalid sequences with special characters which could
then be reintroduced in the matches in the corresponding positions.
For now I guess we should require strings to be valid, and if somebody
is able to implement this later we can always remove this requirement,
as it wouldn't be breaking.
Thanks for your help
Le vendredi 13 avril 2018 à 16:07 +0100, p...@hermes.cam.ac.uk a
écrit :
> On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:
> 
> > I'm writing on behalf of the Julia programming language [1]
> > developers
> > in order to get some information regarding the handling of invalid
> > UTF-
> > 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. 
> 
> Milan,
> 
> I understand what you are suggesting (treating invalid UTF-8 as one-
> byte 
> characters) because I have implemented exactly that in other
> software 
> I've written where performance is not critical.
> 
> However, in regex matching, performance *is* critical, which is why
> PCRE 
> insists on working only with valid UTF strings. Checking each
> sequence 
> for validity each time a character was inspected would degrade 
> performance. (Also, in a backtracking algorithm, the same character
> may
> be inspected multiple times during the course of a match, which only 
> makes matters worse.)
> 
> The code in the PCRE2 library that checks a UTF-8 string for validity
> is
> non-trivial. (It's in the source file src/pcre2_valid_utf.c if you
> want
> to take a look.) Admittedly, it does identify very specific errors in
> invalid sequences, but, for example, checking a 3-byte sequence
> involves
> seven "if" tests of various kinds plus a switch and a table lookup.
> (That's from a quick visual scan of the code; hope I counted right.)
> Ignoring some of the less serious errors (overlong sequences or
> surrogate codes) would simplify this a bit, but not much.
> 
> My view on this has always been that the most efficient approach, in
> the 
> sense of getting the "best" (in some sense) behaviour over all
> applications, is for applications to handle non-standard character
> strings external to PCRE so that it can work as efficiently as
> possible.
> One possible approach for strings of unknown provenance is to run
> without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
> occur, to convert the string (according to whatever rules you want)
> into
> a valid UTF-8 string and then try again.
> 
> > Do you think such a behavior would make sense? Could it be
> > implemented
> > without dramatically impacting performance? Julia could use a
> > custom
> > patch if this feature is not deemed useful for PCRE.
> 
> It certainly makes sense, but I don't think it could be implemented 
> without a serious performance hit. If you want to hack and try, note 
> that the macros whose names start with GETCHAR (in
> pcre2_intmodedep.h) 
> are used for character handling. In the case of UTF-8 these make use
> of 
> GETUTF8, GETUTF8INC, and GETUTF8LEN, which are defined in 
> pcre2_internal.h. However, there are also BACKCHAR, FORWARDCHAR, and 
> ACROSSCHAR for moving around. These macros are used for compilation
> as 
> well as for matching by the interpreter functions pcre2_match() and 
> pcre2_dfa_match(). I don't know what happens in the JIT matcher, as I
> do 
> not maintain that code, but it too assumes valid UTF-8. To be honest,
> I 
> don't really advise trying to hack in this way. I think it makes
> more 
> sense to fix bad strings externally.
> 
> Philip
> 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


[pcre-dev] Support for invalid UTF-8 strings?

2018-04-12 Thread Milan Bouchet-Valat
Dear PCRE developers,

I'm writing on behalf of the Julia programming language [1] developers
in order to get some information regarding the handling of invalid UTF-
8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. For the
context, Julia has taken the stance that strings are stored in UTF-8
but are not required to contain valid UTF-8. This is needed to be able
to work with any contents, like a filename or an invalid text file,
without throwing errors not modifying the input data. This approach is
similar to that adopted by Go (see [2] for an example of problems which
arise when strings are required to be valid Unicode as in Python 3).

Of course, this stance is more complex to hold in the context of
regular expression matching. The PCRE documentation very clearly states
that both the regular expression and the string must be valid Unicode
when PCRE2_UTF is set, and that behavior is undefined if that's not the
case and PCRE2_NO_UTF_CHECK is also set. However, we have been
wondering whether it would be possible to allow the string (not the
regex) to contain invalid UTF-8 when PCRE2_NO_UTF_CHECK is set. In such
a situation, invalid sequences would simply be treated as series of
one-byte "characters" for which all Unicode predicates would be false,
and returned as-is (see [3]). This is how Julia treats invalid UTF-8
strings and it appears to work well. By default, valid UTF-8 would
still be required, but instead of declaring the behavior as undefined
when the string is invalid and PCRE2_NO_UTF_CHECK is set, a well-
defined behavior would be implemented.

Let me stress that we do not suggest supporting invalid regexes, as it
appears difficult to give them a clear and meaningful definition. We
are also aware that we could avoid setting PCRE2_UTF, but the resulting
behavior would not match what is generally expected for strings which
are supposed to contain (possibly invalid) Unicode text.

Do you think such a behavior would make sense? Could it be implemented
without dramatically impacting performance? Julia could use a custom
patch if this feature is not deemed useful for PCRE.

Thanks in advance for your help


1: http://julialang.org
2: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
3: https://github.com/JuliaLang/julia/pull/26731#issuecomment-379580049
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev