Re: [pcre-dev] Support for invalid UTF-8 strings?

2018-04-14 Thread Zoltán Herczeg
Hi,

there are many problems with invalid utf8 strings. Here is some thoughts from 
previous discussions:

- What is type of an invalid character. Is 0xe9 a latin é letter, or something 
else?
- What is a lowercase/uppercase pair of an invalid character.
- Moving around: if you start a match from a middle of a codepoint, and you 
move 1 char forward, and 1 char backward, where you will be?
- Ranges: does [a-\xff] matches to 0xe9?
- What happens with \xd800-\xdf00 ? They are valid codepoints but invalid 
characters. Does [a-\x] matches to them or not?

Actually every people has different idea about what should happen in these 
cases (depending on their own use case). Perhaps the most consistent idea was 
that all invalid bytes have the same type and do not match any character range. 
Still it is costly to determine invalid character fragments (constant subject 
length and other checks), backchar is nightmare.

Implementation perspective: unfortunately pcre (and jit) uses more than just a 
readchar() function, we have added many special cases (especially JIT) to scan 
characters fast. (E.g we know that [a-z] matches only to a single byte in UTF8, 
and discard multibyte UTF8 characters without decoding their values - faster 
than calling a readchar). You would need to go over all optimizations scattered 
around in the code, and that is far from trivial.

Regards,
Zoltan
 
 Eredeti levél 
Feladó: Milan Bouchet-Valat < nalimi...@club.fr (Link -> 
mailto:nalimi...@club.fr) >
Dátum: 2018 április 13 17:23:04
Tárgy: Re: [pcre-dev] Support for invalid UTF-8 strings?
Címzett: pcre-dev@exim.org (Link -> mailto:pcre-dev@exim.org)
 
Hi,
Thanks for the detailed reply, that's very useful. To be honest, I
won't work on implementing this myself, but it's important to know
what's possible to implement when designing APIs.
I think it would be OK for Julia to check whether a string is valid
UTF-8 beforehand (as PCRE currently does), and fall back to a slow path
if it's not. Of course the slow path shouldn't make the standard path
slower, and ideally code duplication would be limited, which might not
be easy. Or maybe the string could be made valid before passing it to
PCRE, replacing invalid sequences with special characters which could
then be reintroduced in the matches in the corresponding positions.
For now I guess we should require strings to be valid, and if somebody
is able to implement this later we can always remove this requirement,
as it wouldn't be breaking.
Thanks for your help
Le vendredi 13 avril 2018 à 16:07 +0100, p...@hermes.cam.ac.uk a
écrit :
> On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:
>
> > I'm writing on behalf of the Julia programming language [1]
> > developers
> > in order to get some information regarding the handling of invalid
> > UTF-
> > 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set.
>
> Milan,
>
> I understand what you are suggesting (treating invalid UTF-8 as one-
> byte
> characters) because I have implemented exactly that in other
> software
> I've written where performance is not critical.
>
> However, in regex matching, performance *is* critical, which is why
> PCRE
> insists on working only with valid UTF strings. Checking each
> sequence
> for validity each time a character was inspected would degrade
> performance. (Also, in a backtracking algorithm, the same character
> may
> be inspected multiple times during the course of a match, which only
> makes matters worse.)
>
> The code in the PCRE2 library that checks a UTF-8 string for validity
> is
> non-trivial. (It's in the source file src/pcre2_valid_utf.c if you
> want
> to take a look.) Admittedly, it does identify very specific errors in
> invalid sequences, but, for example, checking a 3-byte sequence
> involves
> seven "if" tests of various kinds plus a switch and a table lookup.
> (That's from a quick visual scan of the code; hope I counted right.)
> Ignoring some of the less serious errors (overlong sequences or
> surrogate codes) would simplify this a bit, but not much.
>
> My view on this has always been that the most efficient approach, in
> the
> sense of getting the "best" (in some sense) behaviour over all
> applications, is for applications to handle non-standard character
> strings external to PCRE so that it can work as efficiently as
> possible.
> One possible approach for strings of unknown provenance is to run
> without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
> occur, to convert the string (according to whatever rules you want)
> into
> a valid UTF-8 string and then try again.
>
> > Do you think such a behavior would make sense? Could it be
> > implemented
> > without dramatically impacting performance? Julia could use a
> > custo

Re: [pcre-dev] Support for invalid UTF-8 strings?

2018-04-13 Thread Ze'ev Atlas via Pcre-dev
If you want to hide the gory details from the user then you may provide two 
methods with similar signatures.  One would handle known valid utf-8 strings 
and the other would handle suspected strings.  The first one would go straight 
to PCRE2 and the other would do the suggested verification and conversion 
behind the scene and go to PCRE2 with the internally modified string and 
perhaps regex as well.You may leave the second one as a stub that calls  the 
first one for now, until you get around to implement that verification/mutation 
algorithm.I took this approach for something less complicated in my port of 
PCRE2.  Actually, I am making the decision myself so there is only one method 
that the user sees, but my problem was less complicated (all my charcters are 8 
bits, period, and I use a standartized tool to identify and remedy the 
problem).Oh, and the charcter set (s) is/are not ascii, its from a parallel 
world, from a galaxy far far away :) Ze'ev

Sent from Yahoo Mail on Android 
 
  On Fri, Apr 13, 2018 at 11:22 AM, Milan Bouchet-Valat 
wrote:   Hi,
Thanks for the detailed reply, that's very useful. To be honest, I
won't work on implementing this myself, but it's important to know
what's possible to implement when designing APIs.
I think it would be OK for Julia to check whether a string is valid
UTF-8 beforehand (as PCRE currently does), and fall back to a slow path
if it's not. Of course the slow path shouldn't make the standard path
slower, and ideally code duplication would be limited, which might not
be easy. Or maybe the string could be made valid before passing it to
PCRE, replacing invalid sequences with special characters which could
then be reintroduced in the matches in the corresponding positions.
For now I guess we should require strings to be valid, and if somebody
is able to implement this later we can always remove this requirement,
as it wouldn't be breaking.
Thanks for your help
Le vendredi 13 avril 2018 à 16:07 +0100, p...@hermes.cam.ac.uk a
écrit :
> On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:
> 
> > I'm writing on behalf of the Julia programming language [1]
> > developers
> > in order to get some information regarding the handling of invalid
> > UTF-
> > 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. 
> 
> Milan,
> 
> I understand what you are suggesting (treating invalid UTF-8 as one-
> byte 
> characters) because I have implemented exactly that in other
> software 
> I've written where performance is not critical.
> 
> However, in regex matching, performance *is* critical, which is why
> PCRE 
> insists on working only with valid UTF strings. Checking each
> sequence 
> for validity each time a character was inspected would degrade 
> performance. (Also, in a backtracking algorithm, the same character
> may
> be inspected multiple times during the course of a match, which only 
> makes matters worse.)
> 
> The code in the PCRE2 library that checks a UTF-8 string for validity
> is
> non-trivial. (It's in the source file src/pcre2_valid_utf.c if you
> want
> to take a look.) Admittedly, it does identify very specific errors in
> invalid sequences, but, for example, checking a 3-byte sequence
> involves
> seven "if" tests of various kinds plus a switch and a table lookup.
> (That's from a quick visual scan of the code; hope I counted right.)
> Ignoring some of the less serious errors (overlong sequences or
> surrogate codes) would simplify this a bit, but not much.
> 
> My view on this has always been that the most efficient approach, in
> the 
> sense of getting the "best" (in some sense) behaviour over all
> applications, is for applications to handle non-standard character
> strings external to PCRE so that it can work as efficiently as
> possible.
> One possible approach for strings of unknown provenance is to run
> without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
> occur, to convert the string (according to whatever rules you want)
> into
> a valid UTF-8 string and then try again.
> 
> > Do you think such a behavior would make sense? Could it be
> > implemented
> > without dramatically impacting performance? Julia could use a
> > custom
> > patch if this feature is not deemed useful for PCRE.
> 
> It certainly makes sense, but I don't think it could be implemented 
> without a serious performance hit. If you want to hack and try, note 
> that the macros whose names start with GETCHAR (in
> pcre2_intmodedep.h) 
> are used for character handling. In the case of UTF-8 these make use
> of 
> GETUTF8, GETUTF8INC, and GETUTF8LEN, which are defined in 
> pcre2_internal.h. However, there are also BACKCHAR, FORWARDCHAR, and 
> ACROSSCHAR for moving around. These macros are used for compilation
> as 
> well as for matching by the interpreter functions pcre2_match() and 
> pcre2_dfa_match(). I don't know what happens in the JIT matcher, as I
> do 
> not maintain that code, but it too assumes valid UTF-8. To be honest,
> I 

Re: [pcre-dev] Support for invalid UTF-8 strings?

2018-04-13 Thread Milan Bouchet-Valat
Hi,
Thanks for the detailed reply, that's very useful. To be honest, I
won't work on implementing this myself, but it's important to know
what's possible to implement when designing APIs.
I think it would be OK for Julia to check whether a string is valid
UTF-8 beforehand (as PCRE currently does), and fall back to a slow path
if it's not. Of course the slow path shouldn't make the standard path
slower, and ideally code duplication would be limited, which might not
be easy. Or maybe the string could be made valid before passing it to
PCRE, replacing invalid sequences with special characters which could
then be reintroduced in the matches in the corresponding positions.
For now I guess we should require strings to be valid, and if somebody
is able to implement this later we can always remove this requirement,
as it wouldn't be breaking.
Thanks for your help
Le vendredi 13 avril 2018 à 16:07 +0100, p...@hermes.cam.ac.uk a
écrit :
> On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:
> 
> > I'm writing on behalf of the Julia programming language [1]
> > developers
> > in order to get some information regarding the handling of invalid
> > UTF-
> > 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. 
> 
> Milan,
> 
> I understand what you are suggesting (treating invalid UTF-8 as one-
> byte 
> characters) because I have implemented exactly that in other
> software 
> I've written where performance is not critical.
> 
> However, in regex matching, performance *is* critical, which is why
> PCRE 
> insists on working only with valid UTF strings. Checking each
> sequence 
> for validity each time a character was inspected would degrade 
> performance. (Also, in a backtracking algorithm, the same character
> may
> be inspected multiple times during the course of a match, which only 
> makes matters worse.)
> 
> The code in the PCRE2 library that checks a UTF-8 string for validity
> is
> non-trivial. (It's in the source file src/pcre2_valid_utf.c if you
> want
> to take a look.) Admittedly, it does identify very specific errors in
> invalid sequences, but, for example, checking a 3-byte sequence
> involves
> seven "if" tests of various kinds plus a switch and a table lookup.
> (That's from a quick visual scan of the code; hope I counted right.)
> Ignoring some of the less serious errors (overlong sequences or
> surrogate codes) would simplify this a bit, but not much.
> 
> My view on this has always been that the most efficient approach, in
> the 
> sense of getting the "best" (in some sense) behaviour over all
> applications, is for applications to handle non-standard character
> strings external to PCRE so that it can work as efficiently as
> possible.
> One possible approach for strings of unknown provenance is to run
> without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
> occur, to convert the string (according to whatever rules you want)
> into
> a valid UTF-8 string and then try again.
> 
> > Do you think such a behavior would make sense? Could it be
> > implemented
> > without dramatically impacting performance? Julia could use a
> > custom
> > patch if this feature is not deemed useful for PCRE.
> 
> It certainly makes sense, but I don't think it could be implemented 
> without a serious performance hit. If you want to hack and try, note 
> that the macros whose names start with GETCHAR (in
> pcre2_intmodedep.h) 
> are used for character handling. In the case of UTF-8 these make use
> of 
> GETUTF8, GETUTF8INC, and GETUTF8LEN, which are defined in 
> pcre2_internal.h. However, there are also BACKCHAR, FORWARDCHAR, and 
> ACROSSCHAR for moving around. These macros are used for compilation
> as 
> well as for matching by the interpreter functions pcre2_match() and 
> pcre2_dfa_match(). I don't know what happens in the JIT matcher, as I
> do 
> not maintain that code, but it too assumes valid UTF-8. To be honest,
> I 
> don't really advise trying to hack in this way. I think it makes
> more 
> sense to fix bad strings externally.
> 
> Philip
> 
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 


Re: [pcre-dev] Support for invalid UTF-8 strings?

2018-04-13 Thread ph10
On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:

> I'm writing on behalf of the Julia programming language [1] developers
> in order to get some information regarding the handling of invalid UTF-
> 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. 

Milan,

I understand what you are suggesting (treating invalid UTF-8 as one-byte 
characters) because I have implemented exactly that in other software 
I've written where performance is not critical.

However, in regex matching, performance *is* critical, which is why PCRE 
insists on working only with valid UTF strings. Checking each sequence 
for validity each time a character was inspected would degrade 
performance. (Also, in a backtracking algorithm, the same character may
be inspected multiple times during the course of a match, which only 
makes matters worse.)

The code in the PCRE2 library that checks a UTF-8 string for validity is
non-trivial. (It's in the source file src/pcre2_valid_utf.c if you want
to take a look.) Admittedly, it does identify very specific errors in
invalid sequences, but, for example, checking a 3-byte sequence involves
seven "if" tests of various kinds plus a switch and a table lookup.
(That's from a quick visual scan of the code; hope I counted right.)
Ignoring some of the less serious errors (overlong sequences or
surrogate codes) would simplify this a bit, but not much.

My view on this has always been that the most efficient approach, in the 
sense of getting the "best" (in some sense) behaviour over all
applications, is for applications to handle non-standard character
strings external to PCRE so that it can work as efficiently as possible.
One possible approach for strings of unknown provenance is to run
without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
occur, to convert the string (according to whatever rules you want) into
a valid UTF-8 string and then try again.

> Do you think such a behavior would make sense? Could it be implemented
> without dramatically impacting performance? Julia could use a custom
> patch if this feature is not deemed useful for PCRE.

It certainly makes sense, but I don't think it could be implemented 
without a serious performance hit. If you want to hack and try, note 
that the macros whose names start with GETCHAR (in pcre2_intmodedep.h) 
are used for character handling. In the case of UTF-8 these make use of 
GETUTF8, GETUTF8INC, and GETUTF8LEN, which are defined in 
pcre2_internal.h. However, there are also BACKCHAR, FORWARDCHAR, and 
ACROSSCHAR for moving around. These macros are used for compilation as 
well as for matching by the interpreter functions pcre2_match() and 
pcre2_dfa_match(). I don't know what happens in the JIT matcher, as I do 
not maintain that code, but it too assumes valid UTF-8. To be honest, I 
don't really advise trying to hack in this way. I think it makes more 
sense to fix bad strings externally.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev