In perl.git, the branch smoke-me/khw-encode has been created <http://perl5.git.perl.org/perl.git/commitdiff/adf9b819500501defb89b82745d8d368303bec57?hp=0000000000000000000000000000000000000000>
at adf9b819500501defb89b82745d8d368303bec57 (commit) - Log ----------------------------------------------------------------- commit adf9b819500501defb89b82745d8d368303bec57 Author: Karl Williamson <k...@cpan.org> Date: Fri Sep 16 22:21:17 2016 -0600 smoke M utf8.c commit d05927bc3d33f576c529353f46502c626f820d80 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 15 09:09:07 2016 -0600 XXX incomplete: Add sv_utf8_decode_flags M embed.fnc M embed.h M proto.h M sv.c M sv.h commit f0467b8c149eb86abef90c614cbfbaa184d68d3e Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 15 09:06:39 2016 -0600 perlapi: Minor clarifications to sv_utf8_decode M sv.c commit f066d559c1d5c2d6c7cb659bd3ba9a626c8d519f Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 22:40:23 2016 -0600 customized M t/porting/customized.dat commit f62b4cdfb78820b4a46c9835ad655a8cf4792c14 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:20:52 2016 -0600 Use core REPLACEMENT CHARACTER definition This allows the code to now work on EBCDIC as well. M cpan/Encode/Encode/encode.h commit bd211626afa26b7657c5e23e03877512d7f54004 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:16:00 2016 -0600 XXX commit msg: Encode.xs: Rmv unused function M cpan/Encode/Encode.xs commit 69ce7f673fa40415c7407ef08e17c38598da49ca Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:12:39 2016 -0600 Encode.xs: white-space only M cpan/Encode/Encode.xs commit 738ff5eb5c2e8aff55dbe274aee9fa783187c040 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 1 12:12:06 2016 -0600 XXX maybe more in commit msg: Speed up Encode UTF-8 validation checking This replaces the current scheme for checking UTF-8 validity by one in which normal processing doesn't require having to decode the UTF-8 into code points. The copying of characters individually from the input to the output is changed to be a single operation for each entire span of valid input at once. Thus in the normal case, what ends up happening is a tight loop to check the validity, and then a memmove of the entire input to the output, then return. If an error is found, it copies all the valid input before the error, then handles the character in error, then positions to the next input position, and repeats the whole process starting from there. It uses the functionality available from the Perl 5 core to to look at just the bytes that comprise the UTF-8 to make the determination, converting to code points only those that are defective some how in order to display them in warnings and error messages. Thus, this does not need to know about the intricacies of UTF-8 malformations, relying on the core to handle this. This cannot be pushed to CPAN until Devel::PPPort has been updated to implement all the functions now needed. M cpan/Encode/Encode.pm M cpan/Encode/Encode.xs commit 05b298e243da060f60f30d6391ec5c67e4b0eef3 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 20:15:56 2016 -0600 XXX tests: Add is_utf8_buf_flags() and use it This encodes a simple pattern that may not be immediately obvious to someone needing it. If you have a fixed-size buffer that is full of purportedly UTF-8 bytes, is it valid or not? It's easy to do, as shown in this commit. The file test operators -T and -B can be simpified by using this function. M embed.fnc M embed.h M inline.h M pp_sys.c M proto.h commit 502a38034364d3fce09dbeaec8bab135b92170da Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 20:03:16 2016 -0600 XXX Flesh out, tests: Add is_utf8_foo() M embed.fnc M embed.h M inline.h M proto.h commit b543ac64c28bb4d5c1b4da2b54a466d94186b7bf Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 19:57:46 2016 -0600 Move #define to different header Instead of having a comment in one header pointing to the #define in the other, remove the indirection and just have the #define itself where it is needed. M inline.h M utf8.h commit f56a8a8778df5243e73b421332b67ada3657b773 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 19:49:52 2016 -0600 perlapi: Clarify docs for some is_utf8_foo functions M inline.h commit 48624323edfb1387b785116a36a0517803c1c9c5 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 18:54:23 2016 -0600 Add isUTF8_CHAR_flags() macro This is like the previous 2 commits, but the macro takes a flags parameter so any combination of the disallowed flags may be used. The others, along with the original isUTF8_CHAR(), are the most commonly desired strictures, and use an implementation of a, hopefully, inlined trie for speed. This is for generality and the major portion of its implementation isn't inlined. M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M utf8.h commit d3044bce6779aaa538c761d54600aa393b5c6a3c Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 16:52:41 2016 -0600 Add macro for Unicode Corregindum #9 strict This macro follows Unicode Corrigendum #9 to allow non-character code points. These are still discouraged but not completely forbidden. It's best for code that isn't intended to operate on arbitrary other code text to use the original definition, but code that does things, such as source code control, should change to use this definition if it wants to be Unicode-strict. Perl can't adopt C9 wholesale, as it might create security holes in existing applications that rely on Perl keeping non-chars out. M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit 435f2b411812edbbcf8e3b9d6f22c1f3f8aafdb1 Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 13:38:22 2016 -0600 Add macro for determining if UTF-8 is Unicode-strict M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit 8abfbe338708a13de82a75896c9b6405f24dc7d3 Author: Karl Williamson <k...@cpan.org> Date: Mon Sep 12 14:30:15 2016 -0600 perlapi: Clarify isUTF8_CHAR() M utf8.h commit e026936ba1a284adec6fa4b00989aa3c395df5ce Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 17:09:51 2016 -0600 inline.h: Add 'const's; avoid hiding outer variable This changes some formal parameters to be const, and avoids reusing the same variable name within an inner block, to avoid confusion M embed.fnc M inline.h M mathoms.c M proto.h commit c6a3a1a663c525b88ba7002cba8bc5a325916ba4 Author: Karl Williamson <k...@cpan.org> Date: Thu Sep 8 11:34:15 2016 -0600 Add tests for is_valid_partial_utf8_char_flags() M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t commit 3e7888e357befaf1c44194f1f002d959668aed3a Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 11 22:18:57 2016 -0600 Add is_utf8_valid_partial_char_flags() This is a generalization of is_utf8_valid_partial_char to allow the caller to automatically exclude things such as surrogates. M embed.fnc M embed.h M inline.h M proto.h commit d939808d01f53e56eaa4f92a1f4a9683c2a13baa Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 11 09:40:37 2016 -0600 perlapi: Reword description of is_utf8_valid_partial_char M inline.h commit 0a1007ccbfeae6e1e1779c9d321f8cf64808ca49 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:27:37 2016 -0600 Fix off-by-one error in is_utf8_valid_partial_char() M inline.h commit 90defc4a2c69ccf6aa3850e1917cc3e8b1fc979f Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:24:48 2016 -0600 handy.h: Comment memEQs and memNEs M handy.h commit c1cf1a0a1a7e32092e7e86ebeabab4f274027fd7 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:18:59 2016 -0600 utf8.c: Add some UNLIKELYs M utf8.c commit 1ecdbc8ddd9859a4f81e57815fb6c8d0d7df4a27 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:18:16 2016 -0600 utf8.h: Add comment, white-space changes M utf8.h commit 882c2c3c2992acb6a67221d2055d317d43eee106 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:09:44 2016 -0600 Enhance and rename is_utf8_char_slow() This changes the name of this helper function and adds a parameter and functionality to allow it to exclude problematic classes of code points, the same ones excludeable by utf8n_to_uvchar(), like surrogates or non-character code points. M embed.fnc M embed.h M inline.h M proto.h M utf8.c M utf8.h commit df57aa9ba86deeda358f0345ab9145564c4ec06d Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:22:01 2016 -0600 APItest/t/utf8.t: Add tests These fill in gaps in current testing. In particular all the overlong UTF-8 possible edge cases are now tested. M ext/XS-APItest/t/utf8.t commit b6f7cebb4d97b8666b24766d43e369f7fe77fea4 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:14:38 2016 -0600 APItest/utf8.t: Some clean up This adds some information to test names, does some white-space alignments, changes one test to stress things slightly more, and adds a 'use bytes' because in some cases the desired byte-oriented output was not showing up. M ext/XS-APItest/t/utf8.t commit 0ff82715eeaee878beafe899a0dca8c6f670cec0 Author: Karl Williamson <k...@cpan.org> Date: Sun Sep 4 21:32:08 2016 -0600 Test isUTF8_CHAR() M ext/XS-APItest/APItest.xs M ext/XS-APItest/t/utf8.t commit 59c60a40e62af5eabbbc6fe073120d5d2daac783 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 22:19:42 2016 -0600 lib/warnings/utf8: Reinstate warning test I removed this in 35f8c9bd0ff4f298f8bc09ae9848a14a9667a95a, thinking the warning was no longer being raised. But in fact, it was showing a bug, now fixed by the previous commit. M t/lib/warnings/utf8 commit 0094884088c3d72085333f53c123e60e5ab04bd4 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:15:04 2016 -0600 Revamp overlong handling in is_utf8_char_slow, fixing a bug This combines EBCDIC and ASCII branches as much as possible, and fixes a bug that showed up only on EBCDIC platforms, and 64-bit ASCII ones for the highest overlong, where it could erroneously conclude that a sequence was an overlong. M utf8.c commit a4f913a9ff912ba9d59d1ea42a91fb0e407efe0b Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:43:42 2016 -0600 Forbid UTF-8 start bytes 0x FF on 32-bit ASCII These all are for code points that won't fit into a 32 bit word. M utf8.h commit 62d802bd3241f7cd03f406344de7516a1cbc2ba8 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 21:06:39 2016 -0600 utf8.c: Fix typo in comment, add some comments M utf8.c commit f04b0b8d2d2525eed3f5cbfd17ad04f7d6433c10 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 09:00:03 2016 -0600 utf8.c: Extract duplicate code to common fcn Actually the code isn't quite duplicate, but should be because one instance is wrong. This failure would only show up on 64-bit EBCDIC platforms. M embed.fnc M embed.h M ext/XS-APItest/t/utf8.t M proto.h M utf8.c commit 4adf9e30152c6b04a2be2384ff2f08eba17d7ab3 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 08:54:36 2016 -0600 handy.h: Add memLT, memLE, memGT, memGE These correspond to strLT, etc. I am deferring documenting them in case this turns out to be a bad idea for some reason. M handy.h commit 2ceab79252696bcdcd1a85aa33f7894890124f8f Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 10 08:46:18 2016 -0600 XXX unconditionally do memcmp if not sane M perl.h commit 45c86a51c68c42f7b5dccb4685a9e1edf5e4868f Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 14:12:27 2016 -0600 isUTF8_CHAR(): Bring UTF-EBCDIC to parity with ASCII This changes the macro isUTF8_CHAR to have the same number of code points built-in for EBCDIC as ASCII. This obsoletes the IS_UTF8_CHAR_FAST macro, which is removed. Previously, the code generated by regen/regcharclass.pl for ASCII platforms was hand copied into utf8.h, and LIKELY's manually added, then the generating code was commented out. Now this has been done with EBCDIC platforms as well. This makes regenerating regcharclass.h faster. The copied macro in utf8.h is moved by this commit to within the main code section for non-EBCDIC compiles, cutting the number of #ifdef's down, and the comments about it changed somewhat. M regcharclass.h M regen/regcharclass.pl M utf8.h M utfebcdic.h commit 8d9b3365a77be3b6ad6cfbfe520b458da2e08f7e Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 12:15:29 2016 -0600 regen/regcharclass.pl: surrogates are code points They are not "characters" M regcharclass.h M regen/regcharclass.pl commit 0899da10c082013172c212fd291d9e558c849339 Author: Karl Williamson <k...@cpan.org> Date: Sat Sep 3 16:13:15 2016 -0600 Add IS_UTF8_INVARIANT and IS_UVCHR_INVARIANT to API M utf8.h commit a2432ca2e2ba85e8c8c00ef4febf80c842fb5d44 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 7 22:03:21 2016 -0600 utfebcdic.h: Fix typo in comment M utfebcdic.h commit 26c211867588c59e51aae4b9132dba1a35dcb364 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 16:05:35 2016 -0600 Add #defines for XS code for Unicode Corregindum 9 These are convenience macros. M utf8.c M utf8.h commit 056961ce93cc98dc2f60658fc864f7393ab98942 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 16:02:50 2016 -0600 perlapi: Clarify utf8n_to_uvchr entry M utf8.c commit e3fbbd1878d66b0d7d180ed8526964c7124e32d9 Author: Karl Williamson <k...@cpan.org> Date: Wed Sep 14 15:57:34 2016 -0600 perlunicode: Fix typo M pod/perlunicode.pod commit 5f4c87effa7a251db8fbc5d04dbb05b59cd98291 Author: Karl Williamson <k...@cpan.org> Date: Tue Sep 13 16:40:44 2016 -0600 append_utf8_from_native_byte: Add parens for clarity I can never remember the precedence of dereference and ++. M inline.h ----------------------------------------------------------------------- -- Perl5 Master Repository