On 24/10/10 16:01, Gustavo André dos Santos Lopes wrote:
cataphract                               Sun, 24 Oct 2010 15:01:02 +0000

Revision: http://svn.php.net/viewvc?view=revision&revision=304705

Log:
> […]
- For html_entity_decode(), only valid numerical and named entities (as defined
   above for htmlentities()/htmlspecialchars() + !double_encode) are decoded.
   But there is in this case one additional check. Entities that represent
   non-SGML or otherwise invalid characters are not decoded. Note that, in
   HTML5, U+000D is a valid literal character, but the entity&#x0D is not
   valid and is therefore not decoded.

This shouldn't be the behaviour. &#x0D; is invalid for document conformance, but what is conforming doesn't matter here. What matters here is the parser conformance requirements, which for tokenizing character references is <http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references>. This makes is quite clear that &#x0D; decodes to U+000D.

By not decoding &#x0D; it's not a conforming HTML5 implementation, as there are two allowed behaviours when reaching a parse error such as &#x0D; — stop processing entirely (effectively dropping the rest of the string) or following the spec's behaviour for that parse error.

Also, do you decode "&amp" without a trailing semi-colon? That's equally required to be decoded by HTML5.

I presume the implementation decodes the "&amp;" in '<xmp>&amp;</xmp>', which equally makes it non-conforming from an HTML5 POV…

Bug: http://bugs.php.net/52860 (Open) htmlspecialchars/htmlentities stripping 
invalid characters

Ah, this shows the mistake: you follow the section 8.1 (writing HTML documents) for character references instead of 8.2 (parsing HTML documents), while when you're decoding entities you're parsing it…

HTH,

Geoffrey.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to