[PHP-DEV] Re: [PHP-CVS] svn: /php/php-src/trunk/ext/standard/ basic_functions.c basic_functions.h html.c html.h html_tables/ents_basic.txt html_tables/ents_basic_apos.txt html_tables/ents_html401.txt html_tables/ents_html5.txt html_tables/ents_xhtml.txt html_tables/html_table_gen.php html_tables/mappings/8859-1.TXT html_tables/mappings/8859-15.TXT html_tables/mappings/8859-5.TXT html_tables/mappings/CP1251.TXT html_tables/mappings/CP1252.TXT html_tables/mappings/CP866.TXT html_tables/mappings/KOI8-R.TXT html_tables/map

Geoffrey Sneddon Tue, 26 Oct 2010 09:35:45 -0700

On 24/10/10 16:01, Gustavo André dos Santos Lopes wrote:

cataphract                               Sun, 24 Oct 2010 15:01:02 +0000


Revision: http://svn.php.net/viewvc?view=revision&revision=304705

Log:

> […]

- For html_entity_decode(), only valid numerical and named entities (as defined
   above for htmlentities()/htmlspecialchars() + !double_encode) are decoded.
   But there is in this case one additional check. Entities that represent
   non-SGML or otherwise invalid characters are not decoded. Note that, in
   HTML5, U+000D is a valid literal character, but the entity&#x0D is not
   valid and is therefore not decoded.

This shouldn't be the behaviour.  is invalid for documentconformance, but what is conforming doesn't matter here. What mattershere is the parser conformance requirements, which for tokenizingcharacter references is<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references>.This makes is quite clear that  decodes to U+000D.

By not decoding  it's not a conforming HTML5 implementation, asthere are two allowed behaviours when reaching a parse error such as — stop processing entirely (effectively dropping the rest of thestring) or following the spec's behaviour for that parse error.

Also, do you decode "&amp" without a trailing semi-colon? That's equallyrequired to be decoded by HTML5.

I presume the implementation decodes the "&" in '<xmp>&</xmp>',which equally makes it non-conforming from an HTML5 POV…

Bug: http://bugs.php.net/52860 (Open) htmlspecialchars/htmlentities stripping 
invalid characters

Ah, this shows the mistake: you follow the section 8.1 (writing HTMLdocuments) for character references instead of 8.2 (parsing HTMLdocuments), while when you're decoding entities you're parsing it…


HTH,

Geoffrey.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to