[perl.git] branch blead, updated. v5.25.4-152-gd566bd2

Karl Williamson Sat, 17 Sep 2016 16:23:45 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/d566bd20c27a46aecd668d2f739b9515f46ac74f?hp=dd819584009e7adfc0786ed5beaf6c805ef05a2d>


- Log -----------------------------------------------------------------
commit d566bd20c27a46aecd668d2f739b9515f46ac74f
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 7 22:22:01 2016 -0600

    APItest/t/utf8.t: Add tests
    
    These fill in gaps in current testing.  In particular all the overlong
    UTF-8 possible edge cases are now tested.

M       ext/XS-APItest/t/utf8.t

commit 9d2d0ecdeef6b78a8c765be081a02ac8835290c8
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 7 22:14:38 2016 -0600

    APItest/utf8.t: Some clean up
    
    This adds some information to test names, does some white-space
    alignments, changes one test to stress things slightly more, and adds a
    'use bytes' because in some cases the desired byte-oriented output was
    not showing up.

M       ext/XS-APItest/t/utf8.t

commit d78742984f4f0bd6bde1ad6c7f276904d3461805
Author: Karl Williamson <k...@cpan.org>
Date:   Sun Sep 4 21:32:08 2016 -0600

    Test isUTF8_CHAR()

M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t

commit ca93ce3c78e1c94aac14b4ddb00735763f25c1bc
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 22:19:42 2016 -0600

    lib/warnings/utf8: Reinstate warning test
    
    I removed this in 35f8c9bd0ff4f298f8bc09ae9848a14a9667a95a, thinking the
    warning was no longer being raised.  But in fact, it was showing a bug,
    now fixed by the previous commit.

M       t/lib/warnings/utf8

commit af13dd8ae148d022e85f4fdcf737e07416145e28
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 21:15:04 2016 -0600

    Revamp overlong handling in is_utf8_char_slow, fixing a bug
    
    This combines EBCDIC and ASCII branches as much as possible, and fixes a
    bug that showed up only on EBCDIC platforms, and 64-bit ASCII ones for
    the highest overlong, where it could erroneously conclude that a
    sequence was an overlong.
    
    Tests are coming in a future commit.
    .

M       utf8.c

commit 3a498dae9d7c71a6ee50bac25906ff51f22b86ab
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 21:06:39 2016 -0600

    utf8.c: Fix typo in comment, add some comments

M       utf8.c

commit 83dc0f42cb8bfe955e45a5b44b989daddf87570a
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 09:00:03 2016 -0600

    utf8.c: Extract duplicate code to common fcn
    
    Actually the code isn't quite duplicate, but should be because one
    instance is wrong.  This failure would only show up on EBCDIC platforms.
    Tests are coming in a future commit.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/t/utf8.t
M       proto.h
M       utf8.c

commit 062b685031576af47e0f0097d66e7274cccc443f
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 08:54:36 2016 -0600

    handy.h: Add memLT, memLE, memGT, memGE
    
    These correspond to strLT, etc.  I am deferring documenting them in case
    this turns out to be a bad idea for some reason.

M       handy.h

commit a791584fcdb7d580494680fbbcca3e3f880b78bf
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 10 08:46:18 2016 -0600

    Unconditionally define memcmp() if not sane
    
    Prior to this commit, if there was a #define for memcmp that invoked a
    version that Configure deemed to not be sufficient for normal use, it
    was retained, so that perl used the defective version.  This apparently
    hasn't been a problem in the field, but I realized the potential issue
    doing code reading, and am correcting it.

M       perl.h

commit 784d4f31222f1bf7421b1aab87276f4878d60363
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 3 14:12:27 2016 -0600

    isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCII
    
    This changes the macro isUTF8_CHAR to have the same number of code
    points built-in for EBCDIC as ASCII.  This obsoletes the
    IS_UTF8_CHAR_FAST macro, which is removed.
    
    Previously, the code generated by regen/regcharclass.pl for ASCII
    platforms was hand copied into utf8.h, and LIKELY's manually added, then
    the generating code was commented out.  Now this has been done with
    EBCDIC platforms as well.  This makes regenerating regcharclass.h
    faster.
    
    The copied macro in utf8.h is moved by this commit to within the main
    code section for non-EBCDIC compiles, cutting the number of #ifdef's
    down, and the comments about it are changed somewhat.

M       regcharclass.h
M       regen/regcharclass.pl
M       utf8.h
M       utfebcdic.h

commit 21cb232c014d719d883ed0f8d6185dc36037859e
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 3 12:15:29 2016 -0600

    regen/regcharclass.pl: surrogates are code points
    
    They are not "characters"

M       regcharclass.h
M       regen/regcharclass.pl

commit c2b327983e89375d27cb0e1b21f0bd96e7fdd1ce
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Sep 3 16:13:15 2016 -0600

    Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to API

M       utf8.h

commit 1072f3e3675b2d747002e0ee6adbf9c22e344552
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 7 22:03:21 2016 -0600

    utfebcdic.h: Fix typo in comment

M       utfebcdic.h

commit ecc1615fad60409fb5f52f04c821b95dcf54f48d
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 16:05:35 2016 -0600

    Add #defines for XS code for Unicode Corregindum 9
    
    These are convenience macros.

M       utf8.c
M       utf8.h

commit 111fa70000f00b5976674ebcee575247e12e0926
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 16:02:50 2016 -0600

    perlapi: Clarify utf8n_to_uvchr entry

M       utf8.c

commit f66ccb6c49140b3167e4e605ce87d137725df9e7
Author: Karl Williamson <k...@cpan.org>
Date:   Wed Sep 14 15:57:34 2016 -0600

    perlunicode: Fix typo

M       pod/perlunicode.pod

commit a09ec51a7fc3455257a1239b61ef7b53a6b0570d
Author: Karl Williamson <k...@cpan.org>
Date:   Tue Sep 13 16:40:44 2016 -0600

    append_utf8_from_native_byte: Add parens for clarity
    
    I can never remember the precedence of dereference and ++.

M       inline.h
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                 |   1 +
 embed.h                   |   1 +
 ext/XS-APItest/APItest.xs |   7 +
 ext/XS-APItest/t/utf8.t   | 355 +++++++++++++++++++++++++++++++++++++++++-----
 handy.h                   |   5 +
 inline.h                  |   6 +-
 perl.h                    |   5 +-
 pod/perlunicode.pod       |   2 +-
 proto.h                   |   6 +
 regcharclass.h            |  30 +---
 regen/regcharclass.pl     |  32 ++---
 t/lib/warnings/utf8       |   1 +
 utf8.c                    | 290 ++++++++++++++++++++++++-------------
 utf8.h                    | 149 ++++++++++---------
 utfebcdic.h               |  53 ++++++-
 15 files changed, 689 insertions(+), 254 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index c547b56..43ed918 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -708,6 +708,7 @@ ADMpPR      |bool   |isIDFIRST_lazy |NN const char* p
 ADMpPR |bool   |isALNUM_lazy   |NN const char* p
 #ifdef PERL_IN_UTF8_C
 snR    |U8     |to_lower_latin1|const U8 c|NULLOK U8 *p|NULLOK STRLEN *lenp
+inPR   |bool   |is_utf8_cp_above_31_bits|NN const U8 * const s|NN const U8 * 
const e
 #endif
 #if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || 
defined(PERL_IN_REGEXEC_C)
 EXp    |UV        |_to_fold_latin1|const U8 c|NN U8 *p|NN STRLEN *lenp|const 
unsigned int flags
diff --git a/embed.h b/embed.h
index 8be5109..6a15a97 100644
--- a/embed.h
+++ b/embed.h
@@ -1817,6 +1817,7 @@
 #define _to_utf8_case(a,b,c,d,e,f,g)   S__to_utf8_case(aTHX_ a,b,c,d,e,f,g)
 #define check_locale_boundary_crossing(a,b,c,d)        
S_check_locale_boundary_crossing(aTHX_ a,b,c,d)
 #define is_utf8_common(a,b,c,d)        S_is_utf8_common(aTHX_ a,b,c,d)
+#define is_utf8_cp_above_31_bits       S_is_utf8_cp_above_31_bits
 #define swash_scan_list_line(a,b,c,d,e,f,g)    S_swash_scan_list_line(aTHX_ 
a,b,c,d,e,f,g)
 #define swatch_get(a,b,c)      S_swatch_get(aTHX_ a,b,c)
 #define to_lower_latin1                S_to_lower_latin1
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 907c17d..d2c1c33 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -5320,6 +5320,13 @@ test_isUTF8_POSSIBLY_PROBLEMATIC(char ch)
     OUTPUT:
         RETVAL
 
+STRLEN
+test_isUTF8_CHAR(char *s, STRLEN len)
+    CODE:
+        RETVAL = isUTF8_CHAR((U8 *) s, (U8 *) s + len);
+    OUTPUT:
+        RETVAL
+
 UV
 test_toLOWER(UV ord)
     CODE:
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 9b5ed9b..735feba 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -14,6 +14,7 @@ my $pound_sign = chr utf8::unicode_to_native(163);
 sub isASCII { ord "A" == 65 }
 
 sub display_bytes {
+    use bytes;
     my $string = shift;
     return   '"'
            . join("", map { sprintf("\\x%02x", ord $_) } split "", $string)
@@ -215,10 +216,30 @@ my %code_points = (
 
 if ($is64bit) {
     no warnings qw(overflow portable);
-    $code_points{0x100000000}        = (isASCII) ? 
"\xfe\x84\x80\x80\x80\x80\x80" : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0");
-    $code_points{0x1000000000 - 1}   = (isASCII) ? 
"\xfe\xbf\xbf\xbf\xbf\xbf\xbf" : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa1\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
-    $code_points{0x1000000000}       = (isASCII) ? 
"\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80" : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0");
-    $code_points{0xFFFFFFFFFFFFFFFF} = (isASCII) ? 
"\xff\x80\x8f\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf" : 
I8_to_native("\xff\xaf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
+    $code_points{0x100000000}        = (isASCII)
+                                        ?              
"\xfe\x84\x80\x80\x80\x80\x80"
+                                        : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0");
+    $code_points{0x1000000000 - 1}   = (isASCII)
+                                        ?              
"\xfe\xbf\xbf\xbf\xbf\xbf\xbf"
+                                        : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa1\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
+    $code_points{0x1000000000}       = (isASCII)
+                                        ?              
"\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
+                                        : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0");
+    $code_points{0xFFFFFFFFFFFFFFFF} = (isASCII)
+                                        ?              
"\xff\x80\x8f\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+                                        : 
I8_to_native("\xff\xaf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf");
+    if (isASCII) {  # These could falsely show as overlongs in a naive 
implementation
+        $code_points{0x40000000000}  = 
"\xff\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80\x80";
+        $code_points{0x1000000000000} = 
"\xff\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80\x80\x80";
+        $code_points{0x40000000000000} = 
"\xff\x80\x80\x81\x80\x80\x80\x80\x80\x80\x80\x80\x80";
+        $code_points{0x1000000000000000} = 
"\xff\x80\x81\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80";
+        # overflows
+        #$code_points{0xfoo}     = 
"\xff\x81\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80";
+    }
+}
+elsif (! isASCII) { # 32-bit EBCDIC.  64-bit is clearer to handle, so doesn't 
need this test case
+    no warnings qw(overflow portable);
+    $code_points{0x40000000} = 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0");
 }
 
 # Now add in entries for each of code points 0-255, which require special
@@ -391,10 +412,10 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     undef @warnings;
 
     my $display_flags = sprintf "0x%x", $this_utf8_flags;
-    my $ret_ref = test_utf8n_to_uvchr($bytes, $len, $this_utf8_flags);
     my $display_bytes = display_bytes($bytes);
+    my $ret_ref = test_utf8n_to_uvchr($bytes, $len, $this_utf8_flags);
     is($ret_ref->[0], $n, "Verify utf8n_to_uvchr($display_bytes, 
$display_flags) returns $hex_n");
-    is($ret_ref->[1], $len, "Verify utf8n_to_uvchr() for $hex_n returns 
expected length");
+    is($ret_ref->[1], $len, "Verify utf8n_to_uvchr() for $hex_n returns 
expected length: $len");
 
     unless (is(scalar @warnings, 0,
                "Verify utf8n_to_uvchr() for $hex_n generated no warnings"))
@@ -404,9 +425,31 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
 
     undef @warnings;
 
+    my $ret = test_isUTF8_CHAR($bytes, $len);
+    is($ret, $len, "Verify isUTF8_CHAR($display_bytes) returns expected 
length: $len");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR() for $hex_n generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
+    $ret = test_isUTF8_CHAR($bytes, $len - 1);
+    is($ret, 0, "Verify isUTF8_CHAR() with too short length parameter returns 
0");
+
+    unless (is(scalar @warnings, 0,
+               "Verify isUTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+    undef @warnings;
+
     $ret_ref = test_valid_utf8_to_uvchr($bytes);
     is($ret_ref->[0], $n, "Verify valid_utf8_to_uvchr($display_bytes) returns 
$hex_n");
-    is($ret_ref->[1], $len, "Verify valid_utf8_to_uvchr() for $hex_n returns 
expected length");
+    is($ret_ref->[1], $len, "Verify valid_utf8_to_uvchr() for $hex_n returns 
expected length: $len");
 
     unless (is(scalar @warnings, 0,
                "Verify valid_utf8_to_uvchr() for $hex_n generated no 
warnings"))
@@ -430,7 +473,7 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
 
     undef @warnings;
 
-    my $ret = test_uvchr_to_utf8_flags($n, $this_uvchr_flags);
+    $ret = test_uvchr_to_utf8_flags($n, $this_uvchr_flags);
     ok(defined $ret, "Verify uvchr_to_utf8_flags($hex_n, $display_flags) 
returned success");
     is($ret, $bytes, "Verify uvchr_to_utf8_flags($hex_n, $display_flags) 
returns correct bytes");
 
@@ -456,8 +499,8 @@ my @malformations = (
         qr/unexpected continuation byte/
     ],
     [ "premature next character malformation (immediate)",
-        (isASCII) ? "\xc2a" : I8_to_native("\xc5") ."a",
-        2,
+        (isASCII) ? "\xc2\xc2\x80" : I8_to_native("\xc5\xc5\xa0"),
+        3,
         $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 1,
         qr/unexpected non-continuation byte.*immediately after start byte/
     ],
@@ -473,35 +516,211 @@ my @malformations = (
         $UTF8_ALLOW_SHORT, $REPLACEMENT, 2,
         qr/2 bytes, need 4/
     ],
-    [ "overlong malformation", I8_to_native("\xc0$c"), 2,
+    [ "overlong malformation, lowest 2-byte",
+        (isASCII) ? "\xc0\x80" : I8_to_native("\xc0\xa0"),
+        2,
         $UTF8_ALLOW_LONG,
         0,   # NUL
         2,
         qr/2 bytes, need 1/
     ],
-    [ "overflow malformation",
-                    # These are the smallest overflowing on 64 byte machines:
-                    # 2**64
-        (isASCII) ? "\xff\x80\x90\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"
-                  : 
I8_to_native("\xff\xB0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
-        (isASCII) ? 13 : 14,
-        0,  # There is no way to allow this malformation
-        $REPLACEMENT,
-        (isASCII) ? 13 : 14,
-        qr/overflow/
+    [ "overlong malformation, highest 2-byte",
+        (isASCII) ? "\xc1\xbf" : I8_to_native("\xc4\xbf"),
+        2,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0x7F : utf8::unicode_to_native(0xBF),
+        2,
+        qr/2 bytes, need 1/
+    ],
+    [ "overlong malformation, lowest 3-byte",
+        (isASCII) ? "\xe0\x80\x80" : I8_to_native("\xe0\xa0\xa0"),
+        3,
+        $UTF8_ALLOW_LONG,
+        0,   # NUL
+        3,
+        qr/3 bytes, need 1/
+    ],
+    [ "overlong malformation, highest 3-byte",
+        (isASCII) ? "\xe0\x9f\xbf" : I8_to_native("\xe0\xbf\xbf"),
+        3,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0x7FF : 0x3FF,
+        3,
+        qr/3 bytes, need 2/
+    ],
+    [ "overlong malformation, lowest 4-byte",
+        (isASCII) ? "\xf0\x80\x80\x80" : I8_to_native("\xf0\xa0\xa0\xa0"),
+        4,
+        $UTF8_ALLOW_LONG,
+        0,   # NUL
+        4,
+        qr/4 bytes, need 1/
+    ],
+    [ "overlong malformation, highest 4-byte",
+        (isASCII) ? "\xf0\x8F\xbf\xbf" : I8_to_native("\xf0\xaf\xbf\xbf"),
+        4,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0xFFFF : 0x3FFF,
+        4,
+        qr/4 bytes, need 3/
+    ],
+    [ "overlong malformation, lowest 5-byte",
+        (isASCII)
+         ?              "\xf8\x80\x80\x80\x80"
+         : I8_to_native("\xf8\xa0\xa0\xa0\xa0"),
+        5,
+        $UTF8_ALLOW_LONG,
+        0,   # NUL
+        5,
+        qr/5 bytes, need 1/
+    ],
+    [ "overlong malformation, highest 5-byte",
+        (isASCII)
+         ?              "\xf8\x87\xbf\xbf\xbf"
+         : I8_to_native("\xf8\xa7\xbf\xbf\xbf"),
+        5,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0x1FFFFF : 0x3FFFF,
+        5,
+        qr/5 bytes, need 4/
+    ],
+    [ "overlong malformation, lowest 6-byte",
+        (isASCII)
+         ?              "\xfc\x80\x80\x80\x80\x80"
+         : I8_to_native("\xfc\xa0\xa0\xa0\xa0\xa0"),
+        6,
+        $UTF8_ALLOW_LONG,
+        0,   # NUL
+        6,
+        qr/6 bytes, need 1/
+    ],
+    [ "overlong malformation, highest 6-byte",
+        (isASCII)
+         ?              "\xfc\x83\xbf\xbf\xbf\xbf"
+         : I8_to_native("\xfc\xa3\xbf\xbf\xbf\xbf"),
+        6,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0x3FFFFFF : 0x3FFFFF,
+        6,
+        qr/6 bytes, need 5/
+    ],
+    [ "overlong malformation, lowest 7-byte",
+        (isASCII)
+         ?              "\xfe\x80\x80\x80\x80\x80\x80"
+         : I8_to_native("\xfe\xa0\xa0\xa0\xa0\xa0\xa0"),
+        7,
+        $UTF8_ALLOW_LONG,
+        0,   # NUL
+        7,
+        qr/7 bytes, need 1/
+    ],
+    [ "overlong malformation, highest 7-byte",
+        (isASCII)
+         ?              "\xfe\x81\xbf\xbf\xbf\xbf\xbf"
+         : I8_to_native("\xfe\xa1\xbf\xbf\xbf\xbf\xbf"),
+        7,
+        $UTF8_ALLOW_LONG,
+        (isASCII) ? 0x7FFFFFFF : 0x3FFFFFF,
+        7,
+        qr/7 bytes, need 6/
     ],
 );
 
+if (isASCII && ! $is64bit) {    # 32-bit ASCII platform
+    no warnings 'portable';
+    push @malformations,
+        [ "overflow malformation",
+            "\xfe\x84\x80\x80\x80\x80\x80",  # Represents 2**32
+            7,
+            0,  # There is no way to allow this malformation
+            $REPLACEMENT,
+            7,
+            qr/overflow/
+        ],
+        [ "overflow malformation, can tell on first byte",
+            "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80",
+            13,
+            0,  # There is no way to allow this malformation
+            $REPLACEMENT,
+            13,
+            qr/overflow/
+        ];
+}
+else {
+    # On EBCDIC platforms, another overlong test is needed even on 32-bit
+    # systems, whereas it doesn't happen on ASCII except on 64-bit ones.
+
+    no warnings 'portable';
+    no warnings 'overflow'; # Doesn't run on 32-bit systems, but compiles
+    push @malformations,
+        [ "overlong malformation, lowest max-byte",
+            (isASCII)
+             ?              
"\xff\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80\x80"
+             : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+            (isASCII) ? 13 : 14,
+            $UTF8_ALLOW_LONG,
+            0,   # NUL
+            (isASCII) ? 13 : 14,
+            qr/1[34] bytes, need 1/,    # 1[34] to work on either ASCII or 
EBCDIC
+        ],
+        [ "overlong malformation, highest max-byte",
+            (isASCII)    # 2**36-1 on ASCII; 2**30-1 on EBCDIC
+             ?              
"\xff\x80\x80\x80\x80\x80\x80\xbf\xbf\xbf\xbf\xbf\xbf"
+             : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xbf\xbf\xbf\xbf\xbf\xbf"),
+            (isASCII) ? 13 : 14,
+            $UTF8_ALLOW_LONG,
+            (isASCII) ? 0xFFFFFFFFF : 0x3FFFFFFF,
+            (isASCII) ? 13 : 14,
+            qr/1[34] bytes, need 7/,
+        ];
+
+    if (! $is64bit) {   # 32-bit EBCDIC
+        push @malformations,
+        [ "overflow malformation",
+            
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"),
+            14,
+            0,  # There is no way to allow this malformation
+            $REPLACEMENT,
+            14,
+            qr/overflow/
+        ];
+    }
+    else {  # 64-bit
+        push @malformations,
+            [ "overflow malformation",
+               (isASCII)
+                ?              
"\xff\x80\x90\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"
+                : 
I8_to_native("\xff\xb0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                (isASCII) ? 13 : 14,
+                0,  # There is no way to allow this malformation
+                $REPLACEMENT,
+                (isASCII) ? 13 : 14,
+                qr/overflow/
+            ];
+    }
+}
+
 foreach my $test (@malformations) {
     my ($testname, $bytes, $length, $allow_flags, $allowed_uv, $expected_len, 
$message ) = @$test;
 
     next if ! ok(length($bytes) >= $length, "$testname: Make sure won't read 
beyond buffer: " . length($bytes) . " >= $length");
 
+    undef @warnings;
+
+    my $ret = test_isUTF8_CHAR($bytes, $length);
+    is($ret, 0, "$testname: isUTF8_CHAR returns 0");
+    unless (is(scalar @warnings, 0,
+               "$testname: isUTF8_CHAR() generated no warnings"))
+    {
+        diag "The warnings were: " . join(", ", @warnings);
+    }
+
+
     # Test what happens when this malformation is not allowed
     undef @warnings;
     my $ret_ref = test_utf8n_to_uvchr($bytes, $length, 0);
     is($ret_ref->[0], 0, "$testname: disallowed: Returns 0");
-    is($ret_ref->[1], $expected_len, "$testname: disallowed: Returns expected 
length");
+    is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), disallowed: 
Returns expected length: $expected_len");
     if (is(scalar @warnings, 1, "$testname: disallowed: Got a single warning 
")) {
         like($warnings[0], $message, "$testname: disallowed: Got expected 
warning");
     }
@@ -515,9 +734,9 @@ foreach my $test (@malformations) {
         undef @warnings;
         no warnings 'utf8';
         my $ret_ref = test_utf8n_to_uvchr($bytes, $length, 0);
-        is($ret_ref->[0], 0, "$testname: disallowed: no warnings 'utf8': 
Returns 0");
-        is($ret_ref->[1], $expected_len, "$testname: disallowed: no warnings 
'utf8': Returns expected length");
-        if (!is(scalar @warnings, 0, "$testname: disallowed: no warnings 
'utf8': no warnings generated")) {
+        is($ret_ref->[0], 0, "$testname: utf8n_to_uvchr(), disallowed: no 
warnings 'utf8': Returns 0");
+        is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), 
disallowed: no warnings 'utf8': Returns expected length: $expected_len");
+        if (!is(scalar @warnings, 0, "$testname: utf8n_to_uvchr(), disallowed: 
no warnings 'utf8': no warnings generated")) {
             diag "The warnings were: " . join(", ", @warnings);
         }
     }
@@ -526,7 +745,7 @@ foreach my $test (@malformations) {
     undef @warnings;
     $ret_ref = test_utf8n_to_uvchr($bytes, $length, $UTF8_CHECK_ONLY);
     is($ret_ref->[0], 0, "$testname: CHECK_ONLY: Returns 0");
-    is($ret_ref->[1], -1, "$testname: CHECK_ONLY: returns expected length");
+    is($ret_ref->[1], -1, "$testname: CHECK_ONLY: returns -1 for length");
     if (! is(scalar @warnings, 0, "$testname: CHECK_ONLY: no warnings 
generated")) {
         diag "The warnings were: " . join(", ", @warnings);
     }
@@ -536,9 +755,9 @@ foreach my $test (@malformations) {
     # Test when the malformation is allowed
     undef @warnings;
     $ret_ref = test_utf8n_to_uvchr($bytes, $length, $allow_flags);
-    is($ret_ref->[0], $allowed_uv, "$testname: allowed: Returns expected uv");
-    is($ret_ref->[1], $expected_len, "$testname: allowed: Returns expected 
length");
-    if (!is(scalar @warnings, 0, "$testname: allowed: no warnings generated"))
+    is($ret_ref->[0], $allowed_uv, "$testname: utf8n_to_uvchr(), allowed: 
Returns expected uv: " . sprintf("0x%04X", $allowed_uv));
+    is($ret_ref->[1], $expected_len, "$testname: utf8n_to_uvchr(), allowed: 
Returns expected length: $expected_len");
+    if (!is(scalar @warnings, 0, "$testname: utf8n_to_uvchr(), allowed: no 
warnings generated"))
     {
         diag "The warnings were: " . join(", ", @warnings);
     }
@@ -575,6 +794,14 @@ my @tests = (
         (isASCII) ? 4 : 5,
         qr/not Unicode.* may not be portable/
     ],
+    [ "non_unicode whose first byte tells that",
+        (isASCII) ? "\xf5\x80\x80\x80" : I8_to_native("\xfa\xa0\xa0\xa0\xa0"),
+        $UTF8_WARN_SUPER, $UTF8_DISALLOW_SUPER,
+        'non_unicode',
+        (isASCII) ? 0x140000 : 0x200000,
+        (isASCII) ? 4 : 5,
+        qr/not Unicode.* may not be portable/
+    ],
     [ "first of 32 consecutive non-character code points",
         (isASCII) ? "\xef\xb7\x90" : I8_to_native("\xf1\xbf\xae\xb0"),
         $UTF8_WARN_NONCHAR, $UTF8_DISALLOW_NONCHAR,
@@ -857,10 +1084,10 @@ my @tests = (
         # since we have no reports of failures with it.
        (($is64bit)
         ? ((isASCII)
-           ? "\xff\x80\x90\x90\x90\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
+           ?              
"\xff\x80\x90\x90\x90\xbf\xbf\xbf\xbf\xbf\xbf\xbf\xbf"
            : 
I8_to_native("\xff\xB0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"))
         : ((isASCII)
-           ? "\xfe\x86\x80\x80\x80\x80\x80"
+           ?              "\xfe\x86\x80\x80\x80\x80\x80"
            : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"))),
 
         # We include both warning categories to make sure the ABOVE_31_BIT one
@@ -878,12 +1105,51 @@ if ($is64bit) {
     push @tests,
         [ "More than 32 bits",
             (isASCII)
-            ? "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
+            ?              
"\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80"
             : 
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa2\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
             $UTF8_WARN_ABOVE_31_BIT, $UTF8_DISALLOW_ABOVE_31_BIT,
             'utf8', 0x1000000000, (isASCII) ? 13 : 14,
             qr/Code point 0x.* is not Unicode, and not portable/
         ];
+    if (! isASCII) {
+        push @tests,   # These could falsely show wrongly in a naive 
implementation
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x800000000, 14,
+                qr/Code point 0x800000000 is not Unicode, and not portable/
+            ],
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa0\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x10000000000, 14,
+                qr/Code point 0x10000000000 is not Unicode, and not portable/
+            ],
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa0\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x200000000000, 14,
+                qr/Code point 0x200000000000 is not Unicode, and not portable/
+            ],
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa0\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x4000000000000, 14,
+                qr/Code point 0x4000000000000 is not Unicode, and not portable/
+            ],
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa0\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x80000000000000, 14,
+                qr/Code point 0x80000000000000 is not Unicode, and not 
portable/
+            ],
+            [ "requires at least 32 bits",
+                
I8_to_native("\xff\xa1\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
+                $UTF8_WARN_ABOVE_31_BIT,$UTF8_DISALLOW_ABOVE_31_BIT,
+                'utf8', 0x1000000000000000, 14,
+                qr/Code point 0x1000000000000000 is not Unicode, and not 
portable/
+            ];
+    }
 }
 
 foreach my $test (@tests) {
@@ -892,6 +1158,24 @@ foreach my $test (@tests) {
     my $length = length $bytes;
     my $will_overflow = $testname =~ /overflow/;
 
+    {
+        use warnings;
+        undef @warnings;
+        my $ret = test_isUTF8_CHAR($bytes, $length);
+        if ($will_overflow) {
+            is($ret, 0, "isUTF8_CHAR() $testname: returns 0");
+        }
+        else {
+            is($ret, $length,
+               "isUTF8_CHAR() $testname: returns expected length: $length");
+        }
+        unless (is(scalar @warnings, 0,
+                "isUTF8_CHAR() $testname: generated no warnings"))
+        {
+            diag "The warnings were: " . join(", ", @warnings);
+        }
+    }
+
     # This is more complicated than the malformations tested earlier, as there
     # are several orthogonal variables involved.  We test all the subclasses
     # of utf8 warnings to verify they work with and without the utf8 class,
@@ -940,13 +1224,14 @@ foreach my $test (@tests) {
                     }
                     else {
                         unless (is($ret_ref->[0], $allowed_uv,
-                                            "$this_name: Returns expected uv"))
+                                   "$this_name: Returns expected uv: "
+                                 . sprintf("0x%04X", $allowed_uv)))
                         {
                             diag $call;
                         }
                     }
                     unless (is($ret_ref->[1], $expected_len,
-                                    "$this_name: Returns expected length"))
+                        "$this_name: Returns expected length: $expected_len"))
                     {
                         diag $call;
                     }
@@ -1019,7 +1304,7 @@ foreach my $test (@tests) {
                             diag $call;
                         }
                         unless (is($ret_ref->[1], -1,
-                            "$this_name: CHECK_ONLY: returns expected length"))
+                            "$this_name: CHECK_ONLY: returns -1 for length"))
                         {
                             diag $call;
                         }
diff --git a/handy.h b/handy.h
index fc68736..d131cfc 100644
--- a/handy.h
+++ b/handy.h
@@ -493,6 +493,11 @@ Returns zero if non-equal, or non-zero if equal.
        (sizeof(s2)-1 == l && memEQ(s1, ("" s2 ""), (sizeof(s2)-1)))
 #define memNEs(s1, l, s2) !memEQs(s1, l, s2)
 
+#define memLT(s1,s2,l) (memcmp(s1,s2,l) < 0)
+#define memLE(s1,s2,l) (memcmp(s1,s2,l) <= 0)
+#define memGT(s1,s2,l) (memcmp(s1,s2,l) > 0)
+#define memGE(s1,s2,l) (memcmp(s1,s2,l) >= 0)
+
 /*
  * Character classes.
  *
diff --git a/inline.h b/inline.h
index 8071e6e..74343a1 100644
--- a/inline.h
+++ b/inline.h
@@ -269,10 +269,10 @@ S_append_utf8_from_native_byte(const U8 byte, U8** dest)
     PERL_ARGS_ASSERT_APPEND_UTF8_FROM_NATIVE_BYTE;
 
     if (NATIVE_BYTE_IS_INVARIANT(byte))
-        *(*dest)++ = byte;
+        *((*dest)++) = byte;
     else {
-        *(*dest)++ = UTF8_EIGHT_BIT_HI(byte);
-        *(*dest)++ = UTF8_EIGHT_BIT_LO(byte);
+        *((*dest)++) = UTF8_EIGHT_BIT_HI(byte);
+        *((*dest)++) = UTF8_EIGHT_BIT_LO(byte);
     }
 }
 
diff --git a/perl.h b/perl.h
index f1914a8..454304b 100644
--- a/perl.h
+++ b/perl.h
@@ -1041,9 +1041,8 @@ EXTERN_C int usleep(unsigned int);
 #    endif
 #  endif
 #else
-#   ifndef memcmp
-#      define memcmp   my_memcmp
-#   endif
+#   undef memcmp
+#   define memcmp   my_memcmp
 #endif /* HAS_MEMCMP && HAS_SANE_MEMCMP */
 
 #ifndef memzero
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 44c2a98..152c34b 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1534,7 +1534,7 @@ became generally reliable) through v5.18.  The difference 
is that Perl
 treated all C<\p{}> matches as failing, but all C<\P{}> matches as
 succeeding.
 
-One problem with this is that it leads to unexpected, and confusting
+One problem with this is that it leads to unexpected, and confusing
 results in some cases:
 
  chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Failed on <= v5.18
diff --git a/proto.h b/proto.h
index 908deb2..ec99684 100644
--- a/proto.h
+++ b/proto.h
@@ -5550,6 +5550,12 @@ PERL_STATIC_INLINE bool  S_is_utf8_common(pTHX_ const U8 
*const p, SV **swash, co
 #define PERL_ARGS_ASSERT_IS_UTF8_COMMON        \
        assert(p); assert(swash); assert(swashname)
 
+PERL_STATIC_INLINE bool        S_is_utf8_cp_above_31_bits(const U8 * const s, 
const U8 * const e)
+                       __attribute__warn_unused_result__
+                       __attribute__pure__;
+#define PERL_ARGS_ASSERT_IS_UTF8_CP_ABOVE_31_BITS      \
+       assert(s); assert(e)
+
 STATIC U8*     S_swash_scan_list_line(pTHX_ U8* l, U8* const lend, UV* min, 
UV* max, UV* val, const bool wants_value, const U8* const typestr)
                        __attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT_SWASH_SCAN_LIST_LINE  \
diff --git a/regcharclass.h b/regcharclass.h
index 761f081..e0a021a 100644
--- a/regcharclass.h
+++ b/regcharclass.h
@@ -183,7 +183,7 @@
        : ( ( ( ( 0xF4 == ((U8*)s)[0] ) && ( 0x8F == ((U8*)s)[1] ) ) && ( 0xBF 
== ((U8*)s)[2] ) ) && ( ( ((U8*)s)[3] & 0xFE ) == 0xBE ) ) ? 4 : 0 ) : 0 )
 
 /*
-       SURROGATE: Surrogate characters
+       SURROGATE: Surrogate code points
 
        \p{_Perl_Surrogate}
 */
@@ -780,7 +780,7 @@
        : ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && ( 0x42 == ((U8*)s)[1] ) ) && ( 
0x73 == ((U8*)s)[2] ) ) && ( 0x73 == ((U8*)s)[3] ) ) && ( ( ((U8*)s)[4] & 0xFE 
) == 0x72 ) ) ? 5 : 0 ) : 0 )
 
 /*
-       SURROGATE: Surrogate characters
+       SURROGATE: Surrogate code points
 
        \p{_Perl_Surrogate}
 */
@@ -789,17 +789,6 @@
 ( ( ( ( ( ( ((e) - (s)) >= 4 ) && ( 0xDD == ((U8*)s)[0] ) ) && ( 0x65 == 
((U8*)s)[1] || 0x66 == ((U8*)s)[1] ) ) && ( ( 0x41 <= ((U8*)s)[2] && 
((U8*)s)[2] <= 0x4A ) || ( 0x51 <= ((U8*)s)[2] && ((U8*)s ... [302 chars 
truncated]
 
 /*
-       UTF8_CHAR: Matches legal UTF-EBCDIC encoded characters from 2 through 3 
bytes
-
-       0xA0 - 0x3FFF
-*/
-/*** GENERATED CODE ***/
-#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
-( ( 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0x90 ) || ( 
0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 0xAA <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0xAC ) || ( 0xAE <= ((U8*)s)[0]  ... [29 chars truncated]
-    ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) ? 2 : 0  ... [2 chars truncated]
-: ( ( ( ( ( ((U8*)s)[0] & 0xFC ) == 0xB8 ) || ((U8*)s)[0] == 0xBC || ( ( 
((U8*)s)[0] & 0xFE ) == 0xBE ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) && ( ( 0x41 <= (( ... [372 chars truncated]
-
-/*
        QUOTEMETA: Meta-characters that \Q should quote
 
        \p{_Perl_Quotemeta}
@@ -1392,7 +1381,7 @@
        : ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && ( 0x42 == ((U8*)s)[1] ) ) && ( 
0x72 == ((U8*)s)[2] ) ) && ( 0x72 == ((U8*)s)[3] ) ) && ( 0x71 == ((U8*)s)[4] 
|| 0x72 == ((U8*)s)[4] ) ) ? 5 : 0 ) : 0 )
 
 /*
-       SURROGATE: Surrogate characters
+       SURROGATE: Surrogate code points
 
        \p{_Perl_Surrogate}
 */
@@ -1401,17 +1390,6 @@
 ( ( ( ( ( ( ((e) - (s)) >= 4 ) && ( 0xDD == ((U8*)s)[0] ) ) && ( ( ((U8*)s)[1] 
& 0xFE ) == 0x64 ) ) && ( ( 0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4A ) || ( 
0x51 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x59 ... [368 chars truncated]
 
 /*
-       UTF8_CHAR: Matches legal UTF-EBCDIC encoded characters from 2 through 3 
bytes
-
-       0xA0 - 0x3FFF
-*/
-/*** GENERATED CODE ***/
-#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
-( ( 0x78 == ((U8*)s)[0] || 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0x90 ) || ( 0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 
0xAA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xAF ) || ... [52 chars truncated]
-    ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= ((U8*)s) ... [41 chars 
truncated]
-: ( ( ( ((U8*)s)[0] == 0xB7 || ( ( ((U8*)s)[0] & 0xFE ) == 0xB8 ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xBC ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) && ( ( 0x41 <= (( ... [450 chars truncated]
-
-/*
        QUOTEMETA: Meta-characters that \Q should quote
 
        \p{_Perl_Quotemeta}
@@ -1898,6 +1876,6 @@
  * 5c7eb94310e2aaa15702fd6bed24ff0e7ab5448f9a8231d8c49ca96c9e941089 
lib/unicore/mktables
  * cdecb300baad839a6f62791229f551a4fa33f3cbdca08e378dc976466354e778 
lib/unicore/version
  * 913d2f93f3cb6cdf1664db888bf840bc4eb074eef824e082fceda24a9445e60c 
regen/charset_translations.pl
- * 863deb27147ca9d19f764755eafddf26d6227b007b1f5b6b87662bcb46cf491c 
regen/regcharclass.pl
+ * 1876ece914e2c14ed38c8a589adaa3d8193532c3a5bbe9ea5c3279bc9d29b279 
regen/regcharclass.pl
  * 393f8d882713a3ba227351ad0f00ea4839fda74fcf77dcd1cdf31519925adba5 
regen/regcharclass_multi_char_folds.pl
  * ex: set ro: */
diff --git a/regen/regcharclass.pl b/regen/regcharclass.pl
index e22720b..bd677ac 100755
--- a/regen/regcharclass.pl
+++ b/regen/regcharclass.pl
@@ -1633,39 +1633,33 @@ NONCHAR: Non character code points
 => UTF8 :safe
 \p{_Perl_Nchar}
 
-SURROGATE: Surrogate characters
+SURROGATE: Surrogate code points
 => UTF8 :safe
 \p{_Perl_Surrogate}
 
-# This program was run with this enabled, and the results copied to utf8.h;
-# then this was commented out because it takes so long to figure out these 2
-# million code points.  The results would not change unless utf8.h decides it
-# wants a maximum other than 4 bytes, or this program creates better
+# This program was run with this enabled, and the results copied to utf8.h and
+# utfebcdic.h; then this was commented out because it takes so long to figure
+# out these 2 million code points.  The results would not change unless utf8.h
+# decides it wants a different maximum, or this program creates better
 # optimizations.  Trying with 5 bytes used too much memory to calculate.
 #
 # We don't generate code for invariants here because the EBCDIC form is too
 # complicated and would slow things down; instead the user should test for
 # invariants first.
 #
-# NOTE: The number of bytes generated here must match the value in
-# IS_UTF8_CHAR_FAST in utf8.h
+# 0x1FFFFF was chosen because for both UTF-8 and UTF-EBCDIC, its start byte
+# is the same as 0x10FFFF, and it includes all the above-Unicode code points
+# that have that start byte.  In other words, it is the natural stopping place
+# that includes all Unicode code points.
 #
-#UTF8_CHAR: Matches legal UTF-8 encoded characters from 2 through 4 bytes
+#UTF8_CHAR: Matches legal UTF-8 variant code points up through the 0x1FFFFFF
 #=> UTF8 :no_length_checks only_ascii_platform
 #0x80 - 0x1FFFFF
 
-# This hasn't been commented out, but the number of bytes it works on has been
-# cut down to 3, so it doesn't cover the full legal Unicode range.  Making it
-# 5 bytes would cover beyond the full range, but takes quite a bit of time and
-# memory to calculate.  The generated table varies depending on the EBCDIC
-# code page.
+#UTF8_CHAR: Matches legal UTF-EBCDIC variant code points up through 0x1FFFFFF
+#=> UTF8 :no_length_checks only_ebcdic_platform
+#0xA0 - 0x1FFFFF
 
-# NOTE: The number of bytes generated here must match the value in
-# IS_UTF8_CHAR_FAST in utf8.h
-#
-UTF8_CHAR: Matches legal UTF-EBCDIC encoded characters from 2 through 3 bytes
-=> UTF8 :no_length_checks only_ebcdic_platform
-0xA0 - 0x3FFF
 
 QUOTEMETA: Meta-characters that \Q should quote
 => high :fast
diff --git a/t/lib/warnings/utf8 b/t/lib/warnings/utf8
index 947dea4..4263c04 100644
--- a/t/lib/warnings/utf8
+++ b/t/lib/warnings/utf8
@@ -756,6 +756,7 @@ Operation "uc" returns its argument for non-Unicode code 
point 0x7F+ at - line \
 Use of code point 0x80+ is deprecated; the permissible max is 0x7F+ at - line 
\d+.
 Operation "uc" returns its argument for non-Unicode code point 0x80+ at - line 
\d+.
 Code point 0x7F+ is not Unicode, may not be portable in print at - line \d+.
+Use of code point 0x80+ is deprecated; the permissible max is 0x7F+ in print 
at - line \d+.
 ########
 # NAME  [perl #127262]
 BEGIN{
diff --git a/utf8.c b/utf8.c
index 0fcb6b6..2b9ea5b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -296,7 +296,14 @@ contain these.
 
 The flag C<UNICODE_WARN_ILLEGAL_INTERCHANGE> selects all three of
 the above WARN flags; and C<UNICODE_DISALLOW_ILLEGAL_INTERCHANGE> selects all
-three DISALLOW flags.
+three DISALLOW flags.  C<UNICODE_DISALLOW_ILLEGAL_INTERCHANGE> restricts the
+allowed inputs to the strict UTF-8 traditionally defined by Unicode.
+Similarly, C<UNICODE_WARN_ILLEGAL_C9_INTERCHANGE> and
+C<UNICODE_DISALLOW_ILLEGAL_C9_INTERCHANGE> are shortcuts to select the
+above-Unicode and surrogate flags, but not the non-character ones, as
+defined in
+L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>.
+See L<perlunicode/Noncharacter code points>.
 
 Code points above 0x7FFF_FFFF (2**31 - 1) were never specified in any standard,
 so using them is more problematic than other above-Unicode code points.  Perl
@@ -335,6 +342,87 @@ Perl_uvchr_to_utf8_flags(pTHX_ U8 *d, UV uv, UV flags)
     return uvchr_to_utf8_flags(d, uv, flags);
 }
 
+PERL_STATIC_INLINE bool
+S_is_utf8_cp_above_31_bits(const U8 * const s, const U8 * const e)
+{
+    /* Returns TRUE if the first code point represented by the Perl-extended-
+     * UTF-8-encoded string starting at 's', and looking no further than 'e -
+     * 1' doesn't fit into 31 bytes.  That is, that if it is >= 2**31.
+     *
+     * The function handles the case where the input bytes do not include all
+     * the ones necessary to represent a full character.  That is, they may be
+     * the intial bytes of the representation of a code point, but possibly
+     * the final ones necessary for the complete representation may be beyond
+     * 'e - 1'.
+     *
+     * The function assumes that the sequence is well-formed UTF-8 as far as it
+     * goes, and is for a UTF-8 variant code point.  If the sequence is
+     * incomplete, the function returns FALSE if there is any well-formed
+     * UTF-8 byte sequence that can complete it in such a way that a code point
+     * < 2**31 is produced; otherwise it returns TRUE.
+     *
+     * Getting this exactly right is slightly tricky, and has to be done in
+     * several places in this file, so is centralized here.  It is based on the
+     * following table:
+     *
+     * U+7FFFFFFF (2 ** 31 - 1)
+     *      ASCII: \xFD\xBF\xBF\xBF\xBF\xBF
+     *   IBM-1047: \xFE\x41\x41\x41\x41\x41\x41\x42\x73\x73\x73\x73\x73\x73
+     *    IBM-037: \xFE\x41\x41\x41\x41\x41\x41\x42\x72\x72\x72\x72\x72\x72
+     *   POSIX-BC: \xFE\x41\x41\x41\x41\x41\x41\x42\x75\x75\x75\x75\x75\x75
+     *         I8: \xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA1\xBF\xBF\xBF\xBF\xBF\xBF
+     * U+80000000 (2 ** 31):
+     *      ASCII: \xFE\x82\x80\x80\x80\x80\x80
+     *              [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] 10  11  12  13
+     *   IBM-1047: \xFE\x41\x41\x41\x41\x41\x41\x43\x41\x41\x41\x41\x41\x41
+     *    IBM-037: \xFE\x41\x41\x41\x41\x41\x41\x43\x41\x41\x41\x41\x41\x41
+     *   POSIX-BC: \xFE\x41\x41\x41\x41\x41\x41\x43\x41\x41\x41\x41\x41\x41
+     *         I8: \xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA2\xA0\xA0\xA0\xA0\xA0\xA0
+     */
+
+#ifdef EBCDIC
+
+        /* [0] is start byte    [1] [2] [3] [4] [5] [6] [7] */
+    const U8 * const prefix = "\x41\x41\x41\x41\x41\x41\x42";
+    const STRLEN prefix_len = sizeof(prefix) - 1;
+    const STRLEN len = e - s;
+    const cmp_len = MIN(prefix_len, len - 1);
+
+#else
+
+    PERL_UNUSED_ARG(e);
+
+#endif
+
+    PERL_ARGS_ASSERT_IS_UTF8_CP_ABOVE_31_BITS;
+
+    assert(! UTF8_IS_INVARIANT(*s));
+
+#ifndef EBCDIC
+
+    /* Technically, a start byte of FE can be for a code point that fits into
+     * 31 bytes, but not for well-formed UTF-8: doing that requires an overlong
+     * malformation. */
+    return (*s >= 0xFE);
+
+#else
+
+    /* On the EBCDIC code pages we handle, only 0xFE can mean a 32-bit or
+     * larger code point (0xFF is an invariant).  For 0xFE, we need at least 2
+     * bytes, and maybe up through 8 bytes, to be sure if the value is above 31
+     * bits. */
+    if (*s != 0xFE || len == 1) {
+        return FALSE;
+    }
+
+    /* Note that in UTF-EBCDIC, the two lowest possible continuation bytes are
+     * \x41 and \x42. */
+    return cBOOL(memGT(s + 1, prefix, cmp_len));
+
+#endif
+
+}
+
 /*
 
 A helper function for the macro isUTF8_CHAR(), which should be used instead of
@@ -372,69 +460,88 @@ Perl__is_utf8_char_slow(const U8 * const s, const STRLEN 
len)
         }
     }
 
-#ifndef EBCDIC
-
-    /* Here is syntactically valid.  Make sure this isn't the start of an
-     * overlong.  These values were found by manually inspecting the UTF-8
-     * patterns.  See the tables in utf8.h and utfebcdic.h */
-
-    /* This is not needed on modern perls where C0 and C1 are not considered
-     * start bytes. */
-#if 0
-    if (UNLIKELY(*s < 0xC2)) {
-        return 0;
-    }
-#endif
+    /* Here is syntactically valid.  Next, make sure this isn't the start of an
+     * overlong.  Overlongs can occur whenever the number of continuation bytes
+     * changes.  That means whenever the number of leading 1 bits in a start
+     * byte increases from the next lower start byte.  That happens for start
+     * bytes C0, E0, F0, F8, FC, FE, and FF.  On modern perls, the following
+     * illegal start bytes have already been excluded, so don't need to be
+     * tested here;
+     * ASCII platforms: C0, C1
+     * EBCDIC platforms C0, C1, C2, C3, C4, E0
+     *
+     * At least a second byte is required to determine if other sequences will
+     * be an overlong. */
 
     if (len > 1) {
-        if (   (*s == 0xE0 && UNLIKELY(s[1] < 0xA0))
-            || (*s == 0xF0 && UNLIKELY(s[1] < 0x90))
-            || (*s == 0xF8 && UNLIKELY(s[1] < 0x88))
-            || (*s == 0xFC && UNLIKELY(s[1] < 0x84))
-            || (*s == 0xFE && UNLIKELY(s[1] < 0x82)))
-        {
-            return 0;
+        const U8 s0 = NATIVE_UTF8_TO_I8(s[0]);
+        const U8 s1 = NATIVE_UTF8_TO_I8(s[1]);
+
+        /* Each platform has overlongs after the start bytes given above
+         * (expressed in I8 for EBCDIC).  What constitutes an overlong varies
+         * by platform, but the logic is the same, except the E0 overlong has
+         * already been excluded on EBCDIC platforms.   The  values below were
+         * found by manually inspecting the UTF-8 patterns.  See the tables in
+         * utf8.h and utfebcdic.h */
+
+#       ifdef EBCDIC
+#           define F0_ABOVE_OVERLONG 0xB0
+#           define F8_ABOVE_OVERLONG 0xA8
+#           define FC_ABOVE_OVERLONG 0xA4
+#           define FE_ABOVE_OVERLONG 0xA2
+#           define FF_OVERLONG_PREFIX "\xfe\x41\x41\x41\x41\x41\x41\x41"
+                                      /* I8(0xfe) is FF */
+#       else
+
+        if (s0 == 0xE0 && UNLIKELY(s1 < 0xA0)) {
+            return 0;       /* Overlong */
         }
-        if ((len > 6 && UNLIKELY(*s == 0xFF) && UNLIKELY(s[6] < 0x81))) {
-            return 0;
-        }
-    }
 
-#else   /* For EBCDIC, we use I8, which is the same on all code pages */
-    {
-        const U8 s0 = NATIVE_UTF8_TO_I8(*s);
+#           define F0_ABOVE_OVERLONG 0x90
+#           define F8_ABOVE_OVERLONG 0x88
+#           define FC_ABOVE_OVERLONG 0x84
+#           define FE_ABOVE_OVERLONG 0x82
+#           define FF_OVERLONG_PREFIX "\xff\x80\x80\x80\x80\x80\x80"
+#       endif
 
-        /* On modern perls C0-C4 aren't considered start bytes */
-        if ( /* s0 < 0xC5 || */ s0 == 0xE0) {
-            return 0;
+
+        if (   (s0 == 0xF0 && UNLIKELY(s1 < F0_ABOVE_OVERLONG))
+            || (s0 == 0xF8 && UNLIKELY(s1 < F8_ABOVE_OVERLONG))
+            || (s0 == 0xFC && UNLIKELY(s1 < FC_ABOVE_OVERLONG))
+            || (s0 == 0xFE && UNLIKELY(s1 < FE_ABOVE_OVERLONG)))
+        {
+            return 0;       /* Overlong */
         }
 
-        if (len >= 1) {
-            const U8 s1 = NATIVE_UTF8_TO_I8(s[1]);
+#   if defined(UV_IS_QUAD) || defined(EBCDIC)
 
-            if (   (s0 == 0xF0 && UNLIKELY(s1 < 0xB0))
-                || (s0 == 0xF8 && UNLIKELY(s1 < 0xA8))
-                || (s0 == 0xFC && UNLIKELY(s1 < 0xA4))
-                || (s0 == 0xFE && UNLIKELY(s1 < 0x82)))
-            {
-                return 0;
-            }
-            if ((len > 7 && UNLIKELY(s0 == 0xFF) && UNLIKELY(s[7] < 0xA1))) {
-                return 0;
-            }
+        /* Check for the FF overlong.  This happens only if all these bytes
+         * match; what comes after them doesn't matter.  See tables in utf8.h,
+         * utfebcdic.h.  (Can't happen on ASCII 32-bit platforms, as overflows
+         * instead.) */
+
+        if (   len >= sizeof(FF_OVERLONG_PREFIX) - 1
+            && UNLIKELY(memEQ(s, FF_OVERLONG_PREFIX,
+                                               sizeof(FF_OVERLONG_PREFIX) - 
1)))
+        {
+            return 0;       /* Overlong */
         }
-    }
 
 #endif
 
-    /* Now see if this would overflow a UV on this platform.  See if the UTF8
-     * for this code point is larger than that for the highest representable
-     * code point */
+    }
+
+    /* Finally, see if this would overflow a UV on this platform.  See if the
+     * UTF8 for this code point is larger than that for the highest
+     * representable code point.  (For ASCII platforms, we could use memcmp()
+     * because we don't have to convert each byte to I8, but it's very rare
+     * input indeed that would approach overflow, so the loop below will likely
+     * only get executed once */
     y = (const U8 *) HIGHEST_REPRESENTABLE_UTF8;
 
     for (x = s; x < e; x++, y++) {
 
-        /* If the same at this byte, go on to the next */
+        /* If the same as this byte, go on to the next */
         if (UNLIKELY(NATIVE_UTF8_TO_I8(*x) == *y)) {
             continue;
         }
@@ -488,20 +595,30 @@ C<retlen> to C<-1> (cast to C<STRLEN>) and return zero.
 
 Note that this API requires disambiguation between successful decoding a C<NUL>
 character, and an error return (unless the C<UTF8_CHECK_ONLY> flag is set), as
-in both cases, 0 is returned.  To disambiguate, upon a zero return, see if the
-first byte of C<s> is 0 as well.  If so, the input was a C<NUL>; if not, the
-input had an error.
+in both cases, 0 is returned, and, depending on the malformation, C<retlen> may
+be set to 1.  To disambiguate, upon a zero return, see if the first byte of
+C<s> is 0 as well.  If so, the input was a C<NUL>; if not, the input had an
+error.
 
 Certain code points are considered problematic.  These are Unicode surrogates,
 Unicode non-characters, and code points above the Unicode maximum of 0x10FFFF.
 By default these are considered regular code points, but certain situations
-warrant special handling for them.  If C<flags> contains
-C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>, all three classes are treated as
-malformations and handled as such.  The flags C<UTF8_DISALLOW_SURROGATE>,
-C<UTF8_DISALLOW_NONCHAR>, and C<UTF8_DISALLOW_SUPER> (meaning above the legal
-Unicode maximum) can be set to disallow these categories individually.
-
-The flags C<UTF8_WARN_ILLEGAL_INTERCHANGE>, C<UTF8_WARN_SURROGATE>,
+warrant special handling for them, which can be specified using the C<flags>
+parameter.  If C<flags> contains C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>, all
+three classes are treated as malformations and handled as such.  The flags
+C<UTF8_DISALLOW_SURROGATE>, C<UTF8_DISALLOW_NONCHAR>, and
+C<UTF8_DISALLOW_SUPER> (meaning above the legal Unicode maximum) can be set to
+disallow these categories individually.  C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>
+restricts the allowed inputs to the strict UTF-8 traditionally defined by
+Unicode.  Use C<UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE> to use the strictness
+definition given by
+L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>.
+The difference between traditional strictness and C9 strictness is that the
+latter does not forbid non-character code points.  (They are still discouraged,
+however.)  For more discussion see L<perlunicode/Noncharacter code points>.
+
+The flags C<UTF8_WARN_ILLEGAL_INTERCHANGE>,
+C<UTF8_WARN_ILLEGAL_C9_INTERCHANGE>, C<UTF8_WARN_SURROGATE>,
 C<UTF8_WARN_NONCHAR>, and C<UTF8_WARN_SUPER> will cause warning messages to be
 raised for their respective categories, but otherwise the code points are
 considered valid (not malformations).  To get a category to both be treated as
@@ -784,35 +901,14 @@ Perl_utf8n_to_uvchr(pTHX_ const U8 *s, STRLEN curlen, 
STRLEN *retlen, U32 flags)
             /* The maximum code point ever specified by a standard was
              * 2**31 - 1.  Anything larger than that is a Perl extension that
              * very well may not be understood by other applications (including
-             * earlier perl versions on EBCDIC platforms).  On ASCII platforms,
-             * these code points are indicated by the first UTF-8 byte being
-             * 0xFE or 0xFF.  We test for these after the regular SUPER ones,
-             * and before possibly bailing out, so that the slightly more dire
-             * warning will override the regular one. */
-            if (
-#ifndef EBCDIC
-                (*s0 & 0xFE) == 0xFE   /* matches both FE, FF */
-#else
-                 /* The I8 for 2**31 (U+80000000) is
-                  *   \xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA2\xA0\xA0\xA0\xA0\xA0\xA0
-                  * and it turns out that on all EBCDIC pages recognized that
-                  * the UTF-EBCDIC for that code point is
-                  *   \xFE\x41\x41\x41\x41\x41\x41\x43\x41\x41\x41\x41\x41\x41
-                  * For the next lower code point, the 1047 UTF-EBCDIC is
-                  *   \xFE\x41\x41\x41\x41\x41\x41\x42\x73\x73\x73\x73\x73\x73
-                  * The other code pages differ only in the bytes following
-                  * \x42.  Thus the following works (the minimum continuation
-                  * byte is \x41). */
-                *s0 == 0xFE && send - s0 > 7 && (   s0[1] > 0x41
-                                                 || s0[2] > 0x41
-                                                 || s0[3] > 0x41
-                                                 || s0[4] > 0x41
-                                                 || s0[5] > 0x41
-                                                 || s0[6] > 0x41
-                                                 || s0[7] > 0x42)
-#endif
-                && (flags & (UTF8_WARN_ABOVE_31_BIT|UTF8_WARN_SUPER
-                            |UTF8_DISALLOW_ABOVE_31_BIT)))
+             * earlier perl versions on EBCDIC platforms).  We test for these
+             * after the regular SUPER ones, and before possibly bailing out,
+             * so that the slightly more dire warning will override the regular
+             * one. */
+            if (   (flags & (UTF8_WARN_ABOVE_31_BIT
+                            |UTF8_WARN_SUPER
+                            |UTF8_DISALLOW_ABOVE_31_BIT))
+                && UNLIKELY(is_utf8_cp_above_31_bits(s0, send)))
             {
                 if (  ! (flags & UTF8_CHECK_ONLY)
                     &&  (flags & (UTF8_WARN_ABOVE_31_BIT|UTF8_WARN_SUPER))
@@ -3955,25 +4051,19 @@ Perl_check_utf8_print(pTHX_ const U8* s, const STRLEN 
len)
            if (UTF8_IS_SUPER(s, e)) {
                 if (   ckWARN_d(WARN_NON_UNICODE)
                     || (   ckWARN_d(WARN_DEPRECATED)
-#if defined(UV_IS_QUAD)
+#ifndef UV_IS_QUAD
+                        && UNLIKELY(is_utf8_cp_above_31_bits(s, e))
+#else   /* Below is 64-bit words */
                         /* 2**63 and up meet these conditions provided we have
                          * a 64-bit word. */
 #   ifdef EBCDIC
-                        && *s == 0xFE && e - s >= UTF8_MAXBYTES
-                        && s[1] >= 0x49
+                        && *s == 0xFE
+                        && NATIVE_UTF8_TO_I8(s[1]) >= 0xA8
 #   else
-                        && *s == 0xFF && e -s >= UTF8_MAXBYTES
+                        && *s == 0xFF
+                           /* s[1] being above 0x80 overflows */
                         && s[2] >= 0x88
 #   endif
-#else   /* Below is 32-bit words */
-                        /* 2**31 and above meet these conditions on all EBCDIC
-                         * pages recognized for 32-bit platforms */
-#   ifdef EBCDIC
-                        && *s == 0xFE && e - s >= UTF8_MAXBYTES
-                        && s[6] >= 0x43
-#   else
-                        && *s >= 0xFE
-#   endif
 #endif
                 )) {
                     /* A side effect of this function will be to warn */
diff --git a/utf8.h b/utf8.h
index 30b7193..7202dc4 100644
--- a/utf8.h
+++ b/utf8.h
@@ -238,9 +238,17 @@ Perl's extended UTF-8 means we can have start bytes up to 
FF.
  * being encoded in UTF-8 or not? */
 #define OFFUNI_IS_INVARIANT(cp)     isASCII(cp)
 
-/* Is the representation of the code point 'cp' the same regardless of
- * being encoded in UTF-8 or not?  'cp' is native if < 256; Unicode otherwise
- * */
+/*
+=for apidoc Am|bool|UVCHR_IS_INVARIANT|UV cp
+
+Evaluates to 1 if the representation of code point C<cp> is the same whether or
+not it is encoded in UTF-8; otherwise evaluates to 0.  UTF-8 invariant
+characters can be copied as-is when converting to/from UTF-8, saving time.
+C<cp> is Unicode if above 255; otherwise is platform-native.
+
+=cut
+ */
+
 #define UVCHR_IS_INVARIANT(cp)      OFFUNI_IS_INVARIANT(cp)
 
 /* This defines the bits that are to be in the continuation bytes of a 
multi-byte
@@ -295,6 +303,33 @@ Perl's extended UTF-8 means we can have start bytes up to 
FF.
  * encounter */
 #define isUTF8_POSSIBLY_PROBLEMATIC(c) ((U8) c >= 0xED)
 
+/* A helper macro for isUTF8_CHAR, so use that one instead of this.  This was
+ * generated by regen/regcharclass.pl, and then moved here.  Then it was
+ * hand-edited to add some LIKELY() calls, presuming that malformations are
+ * unlikely.  The lines that generated it were then commented out.  This was
+ * done because it takes on the order of 10 minutes to generate, and is never
+ * going to change, unless the generated code is improved, and figuring out
+ * the LIKELYs there would be hard.
+ *
+        UTF8_CHAR: Matches legal UTF-8 variant code points up through 0x1FFFFFF
+
+       0x80 - 0x1FFFFF
+*/
+/*** GENERATED CODE ***/
+#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
+( ( 0xC2 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) ?                          \
+    ( LIKELY( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) ? 2 : 0 )                    \
+: ( 0xE0 == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0xA0 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( 0xE1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEF ) ?                          \
+    ( LIKELY( ( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
+: ( 0xF0 == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( 0x90 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0xBF ) && ( ( 
((U8*)s)[2] & 0xC0 ) == 0x80 ) ) && ( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 : 
0 )\
+: ( ( ( ( 0xF1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xF7 ) && LIKELY( ( 
((U8*)s)[1] & 0xC0 ) == 0x80 ) ) && LIKELY( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) ) 
&& LIKELY( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 :  ... [3 chars truncated]
+
+/* The above macro handles UTF-8 that has this start byte as the maximum */
+#define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF7
+
 #endif /* EBCDIC vs ASCII */
 
 /* 2**UTF_ACCUMULATION_SHIFT - 1 */
@@ -487,13 +522,27 @@ only) byte is pointed to by C<s>.
  * through 255 */
 #define UNI_IS_INVARIANT(cp)   UVCHR_IS_INVARIANT(cp)
 
-/* Is the byte 'c' the same character when encoded in UTF-8 as when not.  This
- * works on both UTF-8 encoded strings and non-encoded, as it returns TRUE in
- * each for the exact same set of bit patterns.  It is valid on a subset of
- * what UVCHR_IS_INVARIANT is valid on, so can just use that; and the compiler
- * should optimize out anything extraneous given the implementation of the
- * latter.  The |0 makes sure this isn't mistakenly called with a ptr argument.
- * */
+/*
+=for apidoc Am|bool|UTF8_IS_INVARIANT|char c
+
+Evaluates to 1 if the byte C<c> represents the same character when encoded in
+UTF-8 as when not; otherwise evaluates to 0.  UTF-8 invariant characters can be
+copied as-is when converting to/from UTF-8, saving time.
+
+In spite of the name, this macro gives the correct result if the input string
+from which C<c> comes is not encoded in UTF-8.
+
+See C<L</UVCHR_IS_INVARIANT>> for checking if a UV is invariant.
+
+=cut
+
+The reason it works on both UTF-8 encoded strings and non-UTF-8 encoded, is
+that it returns TRUE in each for the exact same set of bit patterns.  It is
+valid on a subset of what UVCHR_IS_INVARIANT is valid on, so can just use that;
+and the compiler should optimize out anything extraneous given the
+implementation of the latter.  The |0 makes sure this isn't mistakenly called
+with a ptr argument.
+*/
 #define UTF8_IS_INVARIANT(c)   UVCHR_IS_INVARIANT((c) | 0)
 
 /* Like the above, but its name implies a non-UTF8 input, which as the comments
@@ -638,11 +687,16 @@ case any call to string overloading updates the internal 
UTF-8 encoding flag.
 #define UTF8_ALLOW_FFFF 0
 #define UTF8_ALLOW_SURROGATE 0
 
+/* C9 refers to Unicode Corrigendum #9: allows but discourages non-chars */
+#define UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE                                   
 \
+                                 (UTF8_DISALLOW_SUPER|UTF8_DISALLOW_SURROGATE)
+#define UTF8_WARN_ILLEGAL_C9_INTERCHANGE (UTF8_WARN_SUPER|UTF8_WARN_SURROGATE)
+
 #define UTF8_DISALLOW_ILLEGAL_INTERCHANGE                                      
 \
-                       ( UTF8_DISALLOW_SUPER|UTF8_DISALLOW_NONCHAR             
 \
-                        |UTF8_DISALLOW_SURROGATE)
+                  (UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE|UTF8_DISALLOW_NONCHAR)
 #define UTF8_WARN_ILLEGAL_INTERCHANGE \
-                         
(UTF8_WARN_SUPER|UTF8_WARN_NONCHAR|UTF8_WARN_SURROGATE)
+                          (UTF8_WARN_ILLEGAL_C9_INTERCHANGE|UTF8_WARN_NONCHAR)
+
 #define UTF8_ALLOW_ANY                                                         
 \
            (~( UTF8_DISALLOW_ILLEGAL_INTERCHANGE|UTF8_DISALLOW_ABOVE_31_BIT    
\
                |UTF8_WARN_ILLEGAL_INTERCHANGE|UTF8_WARN_ABOVE_31_BIT))
@@ -747,10 +801,14 @@ point's representation.
 #define UNICODE_DISALLOW_NONCHAR      0x0020
 #define UNICODE_DISALLOW_SUPER        0x0040
 #define UNICODE_DISALLOW_ABOVE_31_BIT 0x0080
+#define UNICODE_WARN_ILLEGAL_C9_INTERCHANGE                                   \
+                                  (UNICODE_WARN_SURROGATE|UNICODE_WARN_SUPER)
 #define UNICODE_WARN_ILLEGAL_INTERCHANGE                                      \
-            (UNICODE_WARN_SURROGATE|UNICODE_WARN_NONCHAR|UNICODE_WARN_SUPER)
+                   (UNICODE_WARN_ILLEGAL_C9_INTERCHANGE|UNICODE_WARN_NONCHAR)
+#define UNICODE_DISALLOW_ILLEGAL_C9_INTERCHANGE                               \
+                          (UNICODE_DISALLOW_SURROGATE|UNICODE_DISALLOW_SUPER)
 #define UNICODE_DISALLOW_ILLEGAL_INTERCHANGE                                  \
- (UNICODE_DISALLOW_SURROGATE|UNICODE_DISALLOW_NONCHAR|UNICODE_DISALLOW_SUPER)
+           (UNICODE_DISALLOW_ILLEGAL_C9_INTERCHANGE|UNICODE_DISALLOW_NONCHAR)
 
 /* For backward source compatibility, as are now the default */
 #define UNICODE_ALLOW_SURROGATE 0
@@ -826,48 +884,6 @@ point's representation.
 /* If you want to exclude surrogates, and beyond legal Unicode, see the blame
  * log for earlier versions which gave details for these */
 
-/* A helper macro for isUTF8_CHAR, so use that one, and not this one.  This is
- * retained solely for backwards compatibility and may be deprecated and
- * removed in a future Perl version.
- *
- * regen/regcharclass.pl generates is_UTF8_CHAR_utf8() macros for up to these
- * number of bytes.  So this has to be coordinated with that file */
-#ifdef EBCDIC
-#   define IS_UTF8_CHAR_FAST(n) ((n) <= 3)
-#else
-#   define IS_UTF8_CHAR_FAST(n) ((n) <= 4)
-#endif
-
-#ifndef EBCDIC
-/* A helper macro for isUTF8_CHAR, so use that one instead of this.  This was
- * generated by regen/regcharclass.pl, and then moved here.  Then it was
- * hand-edited to add some LIKELY() calls, presuming that malformations are
- * unlikely.  The lines that generated it were then commented out.  This was
- * done because it takes on the order of 10 minutes to generate, and is never
- * going to change, unless the generated code is improved, and figuring out
- * there the LIKELYs would be hard.
- *
- * The EBCDIC versions have been cut to not cover all of legal Unicode,
- * otherwise they take too long to generate; besides there is a separate one
- * for each code page, so they are in regcharclass.h instead of here */
-/*
-       UTF8_CHAR: Matches legal UTF-8 encoded characters from 2 through 4 bytes
-
-       0x80 - 0x1FFFFF
-*/
-/*** GENERATED CODE ***/
-#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
-( ( 0xC2 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) ?                          \
-    ( LIKELY( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) ? 2 : 0 )                    \
-: ( 0xE0 == ((U8*)s)[0] ) ?                                                 \
-    ( LIKELY( ( ( ((U8*)s)[1] & 0xE0 ) == 0xA0 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
-: ( 0xE1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEF ) ?                          \
-    ( LIKELY( ( ( ((U8*)s)[1] & 0xC0 ) == 0x80 ) && ( ( ((U8*)s)[2] & 0xC0 ) 
== 0x80 ) ) ? 3 : 0 )\
-: ( 0xF0 == ((U8*)s)[0] ) ?                                                 \
-    ( LIKELY( ( ( 0x90 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0xBF ) && ( ( 
((U8*)s)[2] & 0xC0 ) == 0x80 ) ) && ( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 : 
0 )\
-: ( ( ( ( 0xF1 <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xF7 ) && LIKELY( ( 
((U8*)s)[1] & 0xC0 ) == 0x80 ) ) && LIKELY( ( ((U8*)s)[2] & 0xC0 ) == 0x80 ) ) 
&& LIKELY( ( ((U8*)s)[3] & 0xC0 ) == 0x80 ) ) ? 4 :  ... [3 chars truncated]
-#endif
-
 /*
 
 =for apidoc Am|STRLEN|isUTF8_CHAR|const U8 *s|const U8 *e
@@ -894,15 +910,16 @@ is a valid UTF-8 character.
 =cut
 */
 
-#define isUTF8_CHAR(s, e)   (UNLIKELY((e) <= (s))                           \
-                             ? 0                                            \
-                             : (UTF8_IS_INVARIANT(*s))                      \
-                               ? 1                                          \
-                               : UNLIKELY(((e) - (s)) < UTF8SKIP(s))        \
-                                 ? 0                                        \
-                                 : LIKELY(IS_UTF8_CHAR_FAST(UTF8SKIP(s)))   \
-                                   ? is_UTF8_CHAR_utf8_no_length_checks(s)  \
-                                   : _is_utf8_char_slow(s, UTF8SKIP(s)))
+#define isUTF8_CHAR(s, e)                                                   \
+    (UNLIKELY((e) <= (s))                                                   \
+    ? 0                                                                     \
+    : (UTF8_IS_INVARIANT(*s))                                               \
+    ? 1                                                                     \
+    : UNLIKELY(((e) - (s)) < UTF8SKIP(s))                                   \
+      ? 0                                                                   \
+      : LIKELY(NATIVE_UTF8_TO_I8(*s) <= _IS_UTF8_CHAR_HIGHEST_START_BYTE)   \
+      ? is_UTF8_CHAR_utf8_no_length_checks(s)                               \
+      : _is_utf8_char_slow(s, UTF8SKIP(s)))
 
 #define is_utf8_char_buf(buf, buf_end) isUTF8_CHAR(buf, buf_end)
 
diff --git a/utfebcdic.h b/utfebcdic.h
index 227e0eb..7d37fbc 100644
--- a/utfebcdic.h
+++ b/utfebcdic.h
@@ -203,7 +203,7 @@ above what a 64 bit word can hold
   U+40000..U+FFFFF     F8      * A8..BF  A0..BF  A0..BF  A0..BF
  U+100000..U+10FFFF    F9        A0..A1  A0..BF  A0..BF  A0..BF
     Below are above-Unicode code points
- U+110000..U+1FFFFF    F9      * A2..BF  A0..BF  A0..BF  A0..BF
+ U+110000..U+1FFFFF    F9        A2..BF  A0..BF  A0..BF  A0..BF
  U+200000..U+3FFFFF    FA..FB    A0..BF  A0..BF  A0..BF  A0..BF
  U+400000..U+1FFFFFF   FC      * A4..BF  A0..BF  A0..BF  A0..BF  A0..BF
 U+2000000..U+3FFFFFF   FD        A0..BF  A0..BF  A0..BF  A0..BF  A0..BF
@@ -275,6 +275,57 @@ explicitly forbidden, and the shortest possible encoding 
should always be used
 #   define HIGHEST_REPRESENTABLE_UTF8  
"\xFF\xA0\xA0\xA0\xA0\xA0\xA0\xA3\xBF\xBF\xBF\xBF\xBF\xBF"
 #endif
 
+/* A helper macro for isUTF8_CHAR, so use that one instead of this.  This was
+ * generated by regen/regcharclass.pl, and then moved here.  Then it was
+ * hand-edited to add some LIKELY() calls, presuming that malformations are
+ * unlikely.  The lines that generated it were then commented out.  This was
+ * done because it takes on the order of 10 minutes to generate, and is never
+ * going to change, unless the generated code is improved, and figuring out
+ * the LIKELYs there would be hard.
+ *
+        UTF8_CHAR: Matches legal UTF-EBCDIC variant code points up through 
0x1FFFFFF
+
+       0xA0 - 0x1FFFFF
+*/
+#if '^' == 95 /* CP 1047 */
+
+/*** GENERATED CODE ***/
+#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
+( ( 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0x90 ) || ( 
0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 0xAA <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0xAC ) || ( 0xAE <= ((U8*)s)[0]  ... [29 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) ?  ... [8 chars truncated]
+: ( ( ( ((U8*)s)[0] & 0xFC ) == 0xB8 ) || ((U8*)s)[0] == 0xBC || ( ( 
((U8*)s)[0] & 0xFE ) == 0xBE ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 )  ... [200 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70 ) && ( ( 
0x41 <= ((U8*)s)[2] && ((U8*)s)[2] <= 0x4 ... [342 chars truncated]
+: ( ( 0xDD <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) || 0xE1 == ((U8*)s)[0] || ( 
0xEA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) ) ?\
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 
0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x70  ... [392 chars truncated]
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] 
<= 0x6A ) || ( ((U8*)s)[1] & 0xFC ) == 0x7 ... [584 chars truncated]
+: ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && LIKELY( ( 0x41 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || ( 
0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( (( ... [628 chars truncated]
+
+#endif
+
+#if '^' == 176 /* CP 037 */
+
+/*** GENERATED CODE ***/
+#define is_UTF8_CHAR_utf8_no_length_checks(s)                               \
+( ( 0x78 == ((U8*)s)[0] || 0x80 == ((U8*)s)[0] || ( 0x8A <= ((U8*)s)[0] && 
((U8*)s)[0] <= 0x90 ) || ( 0x9A <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xA0 ) || ( 
0xAA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xAF ) || ... [52 chars truncated]
+    ( LIKELY( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= (( ... [47 chars truncated]
+: ( ((U8*)s)[0] == 0xB7 || ( ( ((U8*)s)[0] & 0xFE ) == 0xB8 ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xBC ) || ( ( ((U8*)s)[0] & 0xEE ) == 0xCA ) || ( ( 
((U8*)s)[0] & 0xFC ) == 0xCC ) ) ?\
+    ( LIKELY( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <=  ... [278 chars truncated]
+: ( 0xDC == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( 0x57 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == 
((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x72 ) ) && ( ( 0x ... [459 chars truncated]
+: ( ( 0xDD <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xDF ) || 0xE1 == ((U8*)s)[0] || ( 
0xEA <= ((U8*)s)[0] && ((U8*)s)[0] <= 0xEC ) ) ?\
+    ( LIKELY( ( ( ( 0x41 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x4A ) || ( 0x51 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 < ... [509 chars truncated]
+: ( 0xED == ((U8*)s)[0] ) ?                                                 \
+    ( LIKELY( ( ( ( ( 0x49 == ((U8*)s)[1] || 0x4A == ((U8*)s)[1] ) || ( 0x51 
<= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F == ((U8*)s)[1] || ( 0x62 <= 
((U8*)s)[1] && ((U8*)s)[1] <= 0x6A ) || ( 0x70 ... [740 chars truncated]
+: ( ( ( ( ( 0xEE == ((U8*)s)[0] ) && LIKELY( ( 0x41 <= ((U8*)s)[1] && 
((U8*)s)[1] <= 0x4A ) || ( 0x51 <= ((U8*)s)[1] && ((U8*)s)[1] <= 0x59 ) || 0x5F 
== ((U8*)s)[1] || ( 0x62 <= ((U8*)s)[1] && ((U8*) ... [784 chars truncated]
+
+#endif
+
+/* The above macro in both code pages handles UTF-8 that has this start byte
+ * (expressed in I8) as the maximum */
+#define _IS_UTF8_CHAR_HIGHEST_START_BYTE 0xF9
 
 /*
  * ex: set ts=8 sts=4 sw=4 et:

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.25.4-152-gd566bd2

Reply via email to