Hi, I got this forwarded as a wishlist bug for libxml2, but that doesn't sound right to me. I always thought control characters are not allowed in XML, though looking in the XML spec, I can't find anything definitive...
Daniel, what do you think? Mike PS: You can see the whole thread on http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=500015 On Wed, Sep 24, 2008 at 07:30:39PM -0700, Matt Kraai wrote: > On Wed, Sep 24, 2008 at 10:12:41AM -0700, Rodrigo Gallardo wrote: > > > The feed at > > > > > > http://jc.ngo.org.uk/~nik/use.perl.journals.rss > > > > > > currently contains a SOH character (i.e., the 0x01 character). When I > > > click on it in Liferea, it displays the following error message: > > > > > > XML Parsing Error: reference to invalid character number > > > Location: file:/// > > > Line Number 20, Column 45: > > > > > > <pre>Aha. On the line 580 of that we have a  character. Which leads > > > me to > > > --------------------------------------------^ > > > > > > The feed has a UTF-8 encoding declaration and the SOH character is a > > > valid Unicode character, so I think this error is in error. > > > > As a matter of fact, the XML spec says > > (http://www.w3.org/TR/REC-xml/#dt-character) > > that > > > > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > > [#x10000-#x10FFFF] > > > > so  is not a valid char for an XML document. > > I don't think this is a correct inference. In > http://www.w3.org/TR/REC-xml/#charsets, it says > > Consequently, XML processors MUST accept any character in the range > specified for Char. ] > > Character Range > > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | /* any Unicode character, > [#xE000-#xFFFD] | excluding the surrogate > [#x10000-#x10FFFF] blocks, FFFE, and FFFF. */ > > but it doesn't specify that it must accept *only* characters in that > range. In fact, the next paragraph states > > All XML processors MUST accept the UTF-8 and UTF-16 encodings of > Unicode 3.1 ... > > In http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt, the > list of Unicode 3.1 characters, the SOH character is the second entry. > > -- > Matt http://ftbfs.org/ > > > _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
