In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/9e92194e47c35a3910a8726b88f0abc6c1c0df3c?hp=7cb18e1b020cd2e5d1de687ae046ab2d48a69301>
- Log ----------------------------------------------------------------- commit 9e92194e47c35a3910a8726b88f0abc6c1c0df3c Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 14:04:15 2010 -0600 perlunicode.pod: Add detail on utf8/locale conflicts M pod/perlunicode.pod commit 4b844e06c99d9c2e251dde1c8abc47508b801786 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 13:37:11 2010 -0600 regcomp.h: Remove unused #define ANYOF_RUNTIME() is no longer used, so can be removed. I had long tried to figure out what the purpose of this was, and discovered it really had none. I think it must have had something to do with locales at one time. But locales don't do well with utf8, and I don't know how to make it better. In any event this wasn't actually accomplishing anything. M regcomp.h commit e051a21d2458b5ce83206336c2ee2d39013b8d1b Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 13:31:13 2010 -0600 reginclass: Remove redundant test The previous re-ordering of this function makes it clear that this test doesn't do anything. It is testing the charclass bitmap, but that was already done in the re-ordered block from a previous commit, so if it didn't succeed there, it won't succeed here. In fact, trying to understand why this code was here was what led me to figure out that it wasn't, and that things could be sped up by doing the reordering. M regexec.c commit 7cdde5444c9ad8cccf237ec340ddb54f58ce3cf0 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 13:17:34 2010 -0600 reginclass: Reorder fastest first This patch simply moves the block of code that does the bitmap tests in front of the block of code that deals with potential things not in the bit map. The reason to do this is that it is faster to find things in the bitmap, than to have to create a utf8 swash. The patch also adds some comments, and the first block doesn't have to test if there has been a match, and the second block does, so if statements for those two blocks are adjusted accordingly. The proof that this doesn't break anything stems from the fact that the routine never stops early. If there wasn't a match in the first block of code, it would execute the second block. Thus swapping the order doesn't affect the outcome. The side effects of the first block are reading in the swash. These side effects won't happen if it no longer gets executed, because the other block matched. And thus an error could be introduced if there were coding errors elsewhere that didn't initialize the swash before using it. But that doesn't appear to be the case, as all tests pass. M regexec.c commit ea6756a64e0a36eb91bcea39467a2253c31192b8 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 12:45:50 2010 -0600 reginclass: Remove unnecessary test The previous changes have made it clear that this test never was useful, so remove it. M regexec.c commit a5a291f55baf01a6b4b1013e2d3c722a0ad77432 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 12:36:49 2010 -0600 reginclass: Make explicit the length assumptions reginclass assumes that can match always at least one character. Make that explicit, and now that we have that length always saved, don't recalculate it. M regexec.c commit f7ab54c630d4ff497dfc2435654b5aee46420b17 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 12:31:11 2010 -0600 reginclass: Rename variable for clarity Several other variables in the routine have the previous name M regexec.c commit bf3f2d85838ef8cae6a146bc0d731316fbdb3554 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 12:07:43 2010 -0600 regcomp.h: Clean up some comments M regcomp.h commit eb91e30040d44cf387bdcd0ad4dd10fec2d95799 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 12:05:23 2010 -0600 ANYOF_LARGE is now the same as ANYOF_CLASS These two #defines now mean the same thing. Free up bit used by ANYOF_LARGE M regcomp.h commit 20ed0b260f81efafa0c8e1cd9413ad026e3f19be Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 11:50:08 2010 -0600 regexec.c: reorder statements for speed The call to reginclass is guaranteed by constness to not change locinput, so if going to fail don't waste time calling it. M regexec.c commit af364d036721fad8cdfe034fea03192968f09774 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 11:20:43 2010 -0600 regexec.c: Add clarifying comment M regexec.c commit f6ad78d8b847361e046c98c7811db99387508807 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 10:37:55 2010 -0600 reginclass: add some consts to prototype M embed.fnc M proto.h M regexec.c commit b32d7d3e8f81b492f99ac325dc00ad4dca683023 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 10:33:50 2010 -0600 regexec.c: Remove redundant line. Now that reginclass is guaranteed to return the match length upon success, the caller need not do it again. M regexec.c commit 4b3cda86f0c2a54a1ac8bbbaf4a2412f98dff6c0 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 10:21:14 2010 -0600 reginclass: Return matched length even if not utf8 This also allows for less special case testing M regexec.c commit 6698fab5a29c48acfd94f01ff8a587819f15be79 Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 09:20:41 2010 -0600 reginclass: Change variable name for clarity. M regexec.c commit eba1359ea44dc786981104981f7f0c719920e6ba Author: Karl Williamson <pub...@khwilliamson.com> Date: Sun Oct 31 09:11:39 2010 -0600 regexec.c: Document existing reginclass behavior M regexec.c commit 8ff01ef0717d3f418692ebdcfbeff3221e20f9f8 Author: Father Chrysostomos <spr...@cpan.org> Date: Sun Oct 31 14:35:59 2010 -0700 perldelta up to 7cb18e1b02 M pod/perldelta.pod ----------------------------------------------------------------------- Summary of changes: embed.fnc | 4 +- pod/perldelta.pod | 63 ++++++++++++++++++-- pod/perlunicode.pod | 9 ++- proto.h | 2 +- regcomp.h | 29 ++++------ regexec.c | 161 +++++++++++++++++++++++++++++---------------------- 6 files changed, 170 insertions(+), 98 deletions(-) diff --git a/embed.fnc b/embed.fnc index f900005..edee2f4 100644 --- a/embed.fnc +++ b/embed.fnc @@ -1790,8 +1790,8 @@ Es |U8 |regtail_study |NN struct RExC_state_t *pRExC_state \ ERs |I32 |regmatch |NN regmatch_info *reginfo|NN regnode *prog ERs |I32 |regrepeat |NN const regexp *prog|NN const regnode *p|I32 max|int depth ERs |I32 |regtry |NN regmatch_info *reginfo|NN char **startpos -ERs |bool |reginclass |NULLOK const regexp *prog|NN const regnode *n|NN const U8 *p|NULLOK STRLEN *lenp\ - |bool do_utf8sv_is_utf8 +ERs |bool |reginclass |NULLOK const regexp * const prog|NN const regnode * const n|NN const U8 * const p|NULLOK STRLEN *lenp\ + |bool const do_utf8sv_is_utf8 Es |CHECKPOINT|regcppush |I32 parenfloor Es |char* |regcppop |NN const regexp *rex ERsn |U8* |reghop3 |NN U8 *s|I32 off|NN const U8 *lim diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 203260c..01dffec 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -1,7 +1,7 @@ =encoding utf8 =for comment -This has been completed up to 0b6a3b5adbe83, except for: +This has been completed up to 7cb18e1b02, except for: 04777d295957ad270188e4debf51b523e07cc5b0 c565ab54dc649bb62cd4d57149d7b2abb21df5f3 1c8d11ca3d0ce8bc11562f159b94c2c7e62dea6c @@ -202,8 +202,6 @@ C<B> has been upgraded from 1.24 to 1.25. =item * -XXX What should the version be? - C<B::Deparse> has been upgraded from 0.99 to 1.01. It fixes deparsing of C<our> followed by a variable with funny characters @@ -265,6 +263,10 @@ C<Math::BigInt::FastCalc> has been upgraded from 0.22 to 0.23. =item * +C<mro> has been upgraded from 1.04 to 1.05. + +=item * + C<NDBM_File> has been upgraded from 1.09 to 1.10. This fixes a memory leak when DBM filters are used. @@ -286,6 +288,12 @@ C<use re "/flags"> pragma. =item * +C<Safe> has been upgraded from 2.28 to 2.29. + +It adds C<&version::vxs::VCMP> to the default share. + +=item * + C<SDBM_File> has been upgraded from 1.07 to 1.08. =item * @@ -533,10 +541,12 @@ C<cophh_> in L<perlapi>. =item * A stash can now have a list of effective names in addition to its usual -name. These can be added and deleted via C<hv_ename_add> and -C<hv_ename_delete>. The first effective name can be accessed via the -C<HvENAME*> macros defined in F<hv.h>. These new functions and macros are -I<not> part of the API. +name. The first effective name can be accessed via the C<HvENAME> macro, +which is now the recommended name to use in MRO linearisations (C<HvNAME> +being a fallback if there is no C<HvENAME>). + +These names are added and deleted via C<hv_ename_add> and +C<hv_ename_delete>. These two functions are I<not> part of the API. =item * @@ -646,6 +656,45 @@ now fixed for regular expressions compiled under the C<"u"> modifier. See L</C<use feature "unicode_strings"> now applies to more regex matching>. L<[perl #18281]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=18281>. +=item * + +Concatenating long strings under C<use encoding> no longer causes perl to +crash +L<[perl #78674]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=78674>. + +=item * + +Typeglob assignments would crash if the glob's stash no longer existed, if +the glob assigned to was named 'ISA' or the glob on either side of the +assignment contained a subroutine. + +=item * + +Calling C<< ->import >> on a class lacking an import method could corrupt the stack result in strange behaviour. For instance, + + push @a, "foo", $b = bar->import; + +would assign 'foo' to C<$b> +L<[perl #63790]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=63790>. + +=item * + +Creating an alias to a package when that package had been detached from the +symbol table would result in corrupted isa caches +L<[perl #77358]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=77358>. + +=item * + +C<.=> followed by C<< <> >> or C<readline> would leak memory if C<$/> +contained characters beyond the octet range and the scalar assigned to +happened to be encoded as UTF8 internally +L<[perl #72246]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=72246>. + +=item * + +The C<recv> function could crash when called with the MSG_TRUNC flag +L<[perl #75082]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=75082>. + =back =head1 Known Problems diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index dfd6d42..b9a43c3 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -1371,7 +1371,7 @@ for more discussion of the issues. =head2 Locales Usually locale settings and Unicode do not affect each other, but -there are a couple of exceptions: +there are exceptions: =over 4 @@ -1386,7 +1386,12 @@ variable, see L<perlrun> for the documentation of the C<-C> switch. Perl tries really hard to work both with Unicode and the old byte-oriented world. Most often this is nice, but sometimes Perl's -straddling of the proverbial fence causes problems. +straddling of the proverbial fence causes problems. Here's an example +of how things can go wrong. A locale can define a code point to be +anything it wants. It could make 'A' into a control character, for example. +But strings encoded in utf8 always have Unicode semantics, so an 'A' in +such a string is always an uppercase letter, never a control, no matter +what the locale says it should be. =back diff --git a/proto.h b/proto.h index 0027180..8cc3281 100644 --- a/proto.h +++ b/proto.h @@ -6425,7 +6425,7 @@ STATIC U8* S_reghopmaybe3(U8 *s, I32 off, const U8 *lim) #define PERL_ARGS_ASSERT_REGHOPMAYBE3 \ assert(s); assert(lim) -STATIC bool S_reginclass(pTHX_ const regexp *prog, const regnode *n, const U8 *p, STRLEN *lenp, bool do_utf8sv_is_utf8) +STATIC bool S_reginclass(pTHX_ const regexp * const prog, const regnode * const n, const U8 * const p, STRLEN *lenp, bool const do_utf8sv_is_utf8) __attribute__warn_unused_result__ __attribute__nonnull__(pTHX_2) __attribute__nonnull__(pTHX_3); diff --git a/regcomp.h b/regcomp.h index 9e1a4e3..8f0b828 100644 --- a/regcomp.h +++ b/regcomp.h @@ -204,15 +204,16 @@ struct regnode_charclass { U8 flags; U8 type; U16 next_off; - U32 arg1; + U32 arg1; /* used as ptr in S_regclass */ char bitmap[ANYOF_BITMAP_SIZE]; /* only compile-time */ }; -struct regnode_charclass_class { /* has [[:blah:]] classes */ - U8 flags; /* should have ANYOF_CLASS here */ +/* has runtime (locale) \d, \w, ..., [:posix:] classes */ +struct regnode_charclass_class { + U8 flags; /* ANYOF_CLASS bit must go here */ U8 type; U16 next_off; - U32 arg1; + U32 arg1; /* used as ptr in S_regclass */ char bitmap[ANYOF_BITMAP_SIZE]; /* both compile-time */ char classflags[ANYOF_CLASSBITMAP_SIZE]; /* and run-time */ }; @@ -271,7 +272,7 @@ struct regnode_charclass_class { /* has [[:blah:]] classes */ #undef STRING #define OP(p) ((p)->type) -#define FLAGS(p) ((p)->flags) /* Caution: Doesn't apply to all +#define FLAGS(p) ((p)->flags) /* Caution: Doesn't apply to all \ regnode types */ #define OPERAND(p) (((struct regnode_string *)p)->string) #define MASK(p) ((char*)OPERAND(p)) @@ -313,23 +314,17 @@ struct regnode_charclass_class { /* has [[:blah:]] classes */ /* Flags for node->flags of ANYOF */ -#define ANYOF_CLASS 0x08 /* has [[:blah:]] classes */ +#define ANYOF_CLASS 0x08 /* has runtime \d, \w, [:posix:], ... */ +#define ANYOF_LARGE ANYOF_CLASS /* Same; name retained for back compat */ #define ANYOF_INVERT 0x04 #define ANYOF_FOLD 0x02 #define ANYOF_LOCALE 0x01 -/* Used for regstclass only */ -#define ANYOF_EOS 0x10 /* Can match an empty string too */ - -/* There is a character or a range past 0xff */ -#define ANYOF_UNICODE 0x20 -#define ANYOF_UNICODE_ALL 0x40 /* Can match any char past 0xff */ - -/* size of node is large (includes class pointer) */ -#define ANYOF_LARGE 0x80 +/* EOS used for regstclass only */ +#define ANYOF_EOS 0x10 /* Can match an empty string too */ -/* Are there any runtime flags on in this node? */ -#define ANYOF_RUNTIME(s) (ANYOF_FLAGS(s) & 0x0f) +#define ANYOF_UNICODE 0x20 /* Matches >= one thing past 0xff */ +#define ANYOF_UNICODE_ALL 0x40 /* Matches 0x100 - infinity */ #define ANYOF_FLAGS_ALL 0xff diff --git a/regexec.c b/regexec.c index 842afaf..2c9b852 100644 --- a/regexec.c +++ b/regexec.c @@ -94,7 +94,11 @@ #define STATIC static #endif -#define REGINCLASS(prog,p,c) (ANYOF_FLAGS(p) ? reginclass(prog,p,c,0,0) : ANYOF_BITMAP_TEST(p,*(c))) +/* Valid for non-utf8 strings only: avoids the reginclass call if there are no + * complications: i.e., if everything matchable is straight forward in the + * bitmap */ +#define REGINCLASS(prog,p,c) (ANYOF_FLAGS(p) ? reginclass(prog,p,c,0,0) \ + : ANYOF_BITMAP_TEST(p,*(c))) /* * Forwards. @@ -3628,22 +3632,22 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog) case ANYOF: if (utf8_target) { STRLEN inclasslen = PL_regeol - locinput; + if (locinput >= PL_regeol) + sayNO; if (!reginclass(rex, scan, (U8*)locinput, &inclasslen, utf8_target)) goto anyof_fail; - if (locinput >= PL_regeol) - sayNO; - locinput += inclasslen ? inclasslen : UTF8SKIP(locinput); + locinput += inclasslen; nextchr = UCHARAT(locinput); break; } else { if (nextchr < 0) nextchr = UCHARAT(locinput); - if (!REGINCLASS(rex, scan, (U8*)locinput)) - goto anyof_fail; if (!nextchr && locinput >= PL_regeol) sayNO; + if (!REGINCLASS(rex, scan, (U8*)locinput)) + goto anyof_fail; nextchr = UCHARAT(++locinput); break; } @@ -6180,91 +6184,60 @@ Perl_regclass_swash(pTHX_ const regexp *prog, register const regnode* node, bool /* - reginclass - determine if a character falls into a character class - The n is the ANYOF regnode, the p is the target string, lenp - is pointer to the maximum length of how far to go in the p - (if the lenp is zero, UTF8SKIP(p) is used), - utf8_target tells whether the target string is in UTF-8. + n is the ANYOF regnode + p is the target string + lenp is pointer to the maximum number of bytes of how far to go in p + (This is assumed wthout checking to always be at least the current + character's size) + utf8_target tells whether p is in UTF-8. + + Returns true if matched; false otherwise. If lenp is not NULL, on return + from a successful match, the value it points to will be updated to how many + bytes in p were matched. If there was no match, the value is undefined, + possibly changed from the input. */ STATIC bool -S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const U8* p, STRLEN* lenp, register bool utf8_target) +S_reginclass(pTHX_ const regexp * const prog, register const regnode * const n, register const U8* const p, STRLEN* lenp, register const bool utf8_target) { dVAR; const char flags = ANYOF_FLAGS(n); bool match = FALSE; UV c = *p; - STRLEN len = 0; - STRLEN plen; + STRLEN c_len = 0; + STRLEN maxlen; PERL_ARGS_ASSERT_REGINCLASS; + /* If c is not already the code point, get it */ if (utf8_target && !UTF8_IS_INVARIANT(c)) { - c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &len, + c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &c_len, (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) | UTF8_ALLOW_FFFF | UTF8_CHECK_ONLY); /* see [perl #37836] for UTF8_ALLOW_ANYUV; [perl #38293] for * UTF8_ALLOW_FFFF */ - if (len == (STRLEN)-1) + if (c_len == (STRLEN)-1) Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); } + else { + c_len = 1; + } - plen = lenp ? *lenp : UNISKIP(NATIVE_TO_UNI(c)); - if (utf8_target || (flags & ANYOF_UNICODE)) { - if (lenp) - *lenp = 0; - if (utf8_target && !ANYOF_RUNTIME(n)) { - if (len != (STRLEN)-1 && c < 256 && ANYOF_BITMAP_TEST(n, c)) - match = TRUE; - } - if (!match && utf8_target && (flags & ANYOF_UNICODE_ALL) && c >= 256) - match = TRUE; - if (!match) { - AV *av; - SV * const sw = regclass_swash(prog, n, TRUE, 0, (SV**)&av); - - if (sw) { - U8 * utf8_p; - if (utf8_target) { - utf8_p = (U8 *) p; - } else { - STRLEN len = 1; - utf8_p = bytes_to_utf8(p, &len); - } - if (swash_fetch(sw, utf8_p, 1)) - match = TRUE; - else if (flags & ANYOF_FOLD) { - if (!match && lenp && av) { - I32 i; - for (i = 0; i <= av_len(av); i++) { - SV* const sv = *av_fetch(av, i, FALSE); - STRLEN len; - const char * const s = SvPV_const(sv, len); - if (len <= plen && memEQ(s, (char*)utf8_p, len)) { - *lenp = len; - match = TRUE; - break; - } - } - } - if (!match) { - U8 tmpbuf[UTF8_MAXBYTES_CASE+1]; - - STRLEN tmplen; - to_utf8_fold(utf8_p, tmpbuf, &tmplen); - if (swash_fetch(sw, tmpbuf, 1)) - match = TRUE; - } - } + /* Use passed in max length, or one character if none passed in or less + * than one character. And assume will match just one character. This is + * overwritten later if matched more. */ + if (lenp) { + maxlen = (*lenp > c_len) ? *lenp : c_len; + *lenp = c_len; - /* If we allocated a string above, free it */ - if (! utf8_target) Safefree(utf8_p); - } - } - if (match && lenp && *lenp == 0) - *lenp = UNISKIP(NATIVE_TO_UNI(c)); } - if (!match && c < 256) { + else { + maxlen = c_len; + } + + /* If this character is potentially in the bitmap, check it */ + if (c < 256) { if (ANYOF_BITMAP_TEST(n, c)) match = TRUE; else if (flags & ANYOF_FOLD) { @@ -6320,6 +6293,56 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const } } + /* If the bitmap didn't (or couldn't) match, and something outside the + * bitmap could match, try that */ + if (! match && utf8_target || (flags & ANYOF_UNICODE)) { + if (utf8_target && (flags & ANYOF_UNICODE_ALL) && c >= 256) { + match = TRUE; + } + else { + AV *av; + SV * const sw = regclass_swash(prog, n, TRUE, 0, (SV**)&av); + + if (sw) { + U8 * utf8_p; + if (utf8_target) { + utf8_p = (U8 *) p; + } else { + STRLEN len = 1; + utf8_p = bytes_to_utf8(p, &len); + } + if (swash_fetch(sw, utf8_p, 1)) + match = TRUE; + else if (flags & ANYOF_FOLD) { + if (!match && lenp && av) { + I32 i; + for (i = 0; i <= av_len(av); i++) { + SV* const sv = *av_fetch(av, i, FALSE); + STRLEN len; + const char * const s = SvPV_const(sv, len); + if (len <= maxlen && memEQ(s, (char*)utf8_p, len)) { + *lenp = len; + match = TRUE; + break; + } + } + } + if (!match) { + U8 tmpbuf[UTF8_MAXBYTES_CASE+1]; + + STRLEN tmplen; + to_utf8_fold(utf8_p, tmpbuf, &tmplen); + if (swash_fetch(sw, tmpbuf, 1)) + match = TRUE; + } + } + + /* If we allocated a string above, free it */ + if (! utf8_target) Safefree(utf8_p); + } + } + } + return (flags & ANYOF_INVERT) ? !match : match; } -- Perl5 Master Repository