On 24/10/10 16:01, Gustavo André dos Santos Lopes wrote:
cataphract Sun, 24 Oct 2010 15:01:02 +0000
Revision: http://svn.php.net/viewvc?view=revision&revision=304705
Log:
> […]
- For html_entity_decode(), only valid numerical and named entities (as defined
above for htmlentities()/htmlspecialchars() + !double_encode) are decoded.
But there is in this case one additional check. Entities that represent
non-SGML or otherwise invalid characters are not decoded. Note that, in
HTML5, U+000D is a valid literal character, but the entity
 is not
valid and is therefore not decoded.
This shouldn't be the behaviour. 
 is invalid for document
conformance, but what is conforming doesn't matter here. What matters
here is the parser conformance requirements, which for tokenizing
character references is
<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references>.
This makes is quite clear that 
 decodes to U+000D.
By not decoding 
 it's not a conforming HTML5 implementation, as
there are two allowed behaviours when reaching a parse error such as

 — stop processing entirely (effectively dropping the rest of the
string) or following the spec's behaviour for that parse error.
Also, do you decode "&" without a trailing semi-colon? That's equally
required to be decoded by HTML5.
I presume the implementation decodes the "&" in '<xmp>&</xmp>',
which equally makes it non-conforming from an HTML5 POV…
Bug: http://bugs.php.net/52860 (Open) htmlspecialchars/htmlentities stripping
invalid characters
Ah, this shows the mistake: you follow the section 8.1 (writing HTML
documents) for character references instead of 8.2 (parsing HTML
documents), while when you're decoding entities you're parsing it…
HTH,
Geoffrey.
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php