On Wed, Sep 14, 2011 at 10:19:20AM +0200, Murray Cumming wrote: > On Wed, 2011-09-14 at 16:10 +0800, Daniel Veillard wrote: > > On Fri, Sep 09, 2011 at 04:30:45PM +0200, Murray Cumming wrote: > > > On Fri, 2011-09-09 at 10:21 -0400, Jason Viers wrote: > > > > On 9/9/2011 05:37, Murray Cumming wrote: > > > > > Here is a simple test case that takes the text from an > > > > > apparently-valid > > > > > UTF-8 file > > > > > > > > Not all valid UTF-8 is valid in XML. Only a subset, as defined in > > > > http://www.w3.org/TR/2008/REC-xml-20081126/#charsets > > > > > > > > Note that Form Feed (0xC) is not allowed. Your original input document > > > > contains a formfeed character, and this is what ends up being invalid. > > > > It's not a matter of escaping; form feed as a literal byte, numeric > > > > reference, etc., is not allowed. > > > > Stripping the form feed from the input allows it to serialize properly. > > > > > > Ah, I didn't know that it couldn't be there even if escaped. Thanks. > > > > > > Shouldn't libxml warn about that at the same time that it would escape > > > characters such as & and < rather than writing invalid XML? > > > > It's a choice, either you make all APIs validate all input strings > > or you rely on the client to do it. In libxml2 I took the second path > > and that was decided 10+ years ago. The parser on the other hand is > > strict but that's mandatory to follow the spec. > > OK. Thanks. Is that documented?
yes and no, you used http://xmlsoft.org/html/libxml-tree.html#xmlNewText which used an xmlChar * which you casted from a string. http://xmlsoft.org/FAQ.html#Developer at the end, the FAQ states: --------------------------------------- # So what is this funky "xmlChar" used all the time? It is a null terminated sequence of utf-8 characters. And only utf-8! You need to convert strings encoded in different ways to utf-8 before passing them to the API. This can be accomplished with the iconv library for instance. --------------------------------------- usually we have problem with different encoding being passed rather than error due to characters from Unicode but not accepted by XML (not that many), maybe that should be made clearer. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ dan...@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml