Re: [webkit-dev] libxml2 override encoding support
Alex Milowski: On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote: 04.01.2011, в 18:40, Alex Milowski написал(а): Looking at the libxml2 API, I've been baffled myself about how to control the character encoding from the outside. This looks like a serious lack of an essential feature. Anyone know about this above hack and can provide more detail? Here is some history: http://mail.gnome.org/archives/xml/2006-February/msg00052.html, https://bugzilla.gnome.org/show_bug.cgi?id=614333. Well, that is some interesting history. *sigh* I take it the work around is that data is read and decoded into an internal string which is represented by a sequence of UChar. As such, we treat it as UTF16 character encoded data and feed that to the parser, forcing it to use UTF16 every time. Too bad we can't just tell it the proper encoding--possibly the one from the transport--and let it do the decoding on the raw data. Of course, that doesn't guarantee a better result. Is there a reason why we can't pass the raw data to libxml2? E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 converts it back into UTF-8 (its internal format). This is a real performance problem when parsing XML [1]. Is there some (required) magic involved when detecting the encoding in WebKit? AFAIK XML always defaults to UTF-8 if there's no encoding declared. Can we make libxml2 do the encoding detection and provide all of our decoders so it can use it? [1] https://bugs.webkit.org/show_bug.cgi?id=43085 - Patrick ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
On Wed, Jan 5, 2011 at 5:07 AM, Patrick Gansterer par...@paroga.com wrote: Is there a reason why we can't pass the raw data to libxml2? E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 converts it back into UTF-8 (its internal format). This is a real performance problem when parsing XML [1]. Is there some (required) magic involved when detecting the encoding in WebKit? AFAIK XML always defaults to UTF-8 if there's no encoding declared. Can we make libxml2 do the encoding detection and provide all of our decoders so it can use it? [1] https://bugs.webkit.org/show_bug.cgi?id=43085 Looking at that bug, the XSLT argument is a red herring. We don't use libxml's data structures and so when we use libxslt we either turn the XML parser completely over to libxslt or we serialize and re-parse (that's how the javascript-invoked XLST works). In both cases, we're probably incurring a penalty for this double decoding of Unicode encodings. A native XML parser for WebKit would help in the situation where you aren't using XSLT. Only a native or different XSLT processor in conjunction with a native XML parser would help in all cases. The XSLT processor question is a thorny one that I brought up awhile ago. I personally would love to see us use a processor that has better integration with WebKit's API. There are a handful of choices but many of them are XSLT 2.0. -- --Alex Milowski The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered. Bertrand Russell in a footnote of Principles of Mathematics ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
On Tue, Jan 4, 2011 at 7:14 PM, Eric Seidel e...@webkit.org wrote: You should feel encouraged to speak with dv (http://veillard.com/) more about this issue. Certainly I'd love to get rid of the hack, but I gave up after that email exchange. In the shorter term, fixing this bug or lack of feature in libxml2 would be ideal. I need to understand the two different ways we invoke XML parsing a bit better. We bootstrap the libxml2 parser slightly different depending on whether it is a string parser or a memory parser. Why is there a difference? -- --Alex Milowski The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered. Bertrand Russell in a footnote of Principles of Mathematics ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote: Is there a reason why we can't pass the raw data to libxml2? Because libxml2 does its own encoding detection which is not even close to what’s specified in HTML5, and supports far fewer encodings. If you make a test suite you will see. On the other hand, you could probably make a path that lets libxml2 do the decoding for the most common encodings when specified in a way that we know libxml2 detects correctly, after doing some testing to see if it handles everything right. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
Darin Adler: On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote: Is there a reason why we can't pass the raw data to libxml2? Because libxml2 does its own encoding detection which is not even close to what’s specified in HTML5, and supports far fewer encodings. If you make a test suite you will see. Can you point me to the place of the XML encoding rules? After a short look into the spec I didn't find something which applies to XML input encoding. AFAIK it's possible to teach libxml2 additional encodings. On the other hand, you could probably make a path that lets libxml2 do the decoding for the most common encodings when specified in a way that we know libxml2 detects correctly, after doing some testing to see if it handles everything right. That's something I'd like to do, but I need some time when I can do it. ;-) My first step was to improve the performance of libxml2 - WebKit. - Patrick ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
On Jan 5, 2011, at 8:38 AM, Patrick Gansterer wrote: Darin Adler: On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote: Is there a reason why we can't pass the raw data to libxml2? Because libxml2 does its own encoding detection which is not even close to what’s specified in HTML5, and supports far fewer encodings. If you make a test suite you will see. Can you point me to the place of the XML encoding rules? After a short look into the spec I didn't find something which applies to XML input encoding. AFAIK it's possible to teach libxml2 additional encodings. I’m not sure where it is. I’ll let you know if I stumble on it later. -- Darin ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
[webkit-dev] libxml2 override encoding support
I'm working through some rather thorny experiments with new XML support within the browser and I ran into this snippet: static void switchToUTF16(xmlParserCtxtPtr ctxt) { // Hack around libxml2's lack of encoding overide support by manually // resetting the encoding to UTF-16 before every chunk. Otherwise libxml // will detect ?xml version=1.0 encoding=encoding name? blocks // and switch encodings, causing the parse to fail. const UChar BOM = 0xFEFF; const unsigned char BOMHighByte = *reinterpret_castconst unsigned char*(BOM); xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ? XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE); } Looking at the libxml2 API, I've been baffled myself about how to control the character encoding from the outside. This looks like a serious lack of an essential feature. Anyone know about this above hack and can provide more detail? -- --Alex Milowski The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered. Bertrand Russell in a footnote of Principles of Mathematics ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
04.01.2011, в 18:40, Alex Milowski написал(а): Looking at the libxml2 API, I've been baffled myself about how to control the character encoding from the outside. This looks like a serious lack of an essential feature. Anyone know about this above hack and can provide more detail? Here is some history: http://mail.gnome.org/archives/xml/2006-February/msg00052.html, https://bugzilla.gnome.org/show_bug.cgi?id=614333. - WBR, Alexey Proskuryakov ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
You should feel encouraged to speak with dv (http://veillard.com/) more about this issue. Certainly I'd love to get rid of the hack, but I gave up after that email exchange. -eric On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote: 04.01.2011, в 18:40, Alex Milowski написал(а): Looking at the libxml2 API, I've been baffled myself about how to control the character encoding from the outside. This looks like a serious lack of an essential feature. Anyone know about this above hack and can provide more detail? Here is some history: http://mail.gnome.org/archives/xml/2006-February/msg00052.html, https://bugzilla.gnome.org/show_bug.cgi?id=614333. - WBR, Alexey Proskuryakov ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
Re: [webkit-dev] libxml2 override encoding support
On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote: 04.01.2011, в 18:40, Alex Milowski написал(а): Looking at the libxml2 API, I've been baffled myself about how to control the character encoding from the outside. This looks like a serious lack of an essential feature. Anyone know about this above hack and can provide more detail? Here is some history: http://mail.gnome.org/archives/xml/2006-February/msg00052.html, https://bugzilla.gnome.org/show_bug.cgi?id=614333. Well, that is some interesting history. *sigh* I take it the work around is that data is read and decoded into an internal string which is represented by a sequence of UChar. As such, we treat it as UTF16 character encoded data and feed that to the parser, forcing it to use UTF16 every time. Too bad we can't just tell it the proper encoding--possibly the one from the transport--and let it do the decoding on the raw data. Of course, that doesn't guarantee a better result. -- --Alex Milowski The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered. Bertrand Russell in a footnote of Principles of Mathematics ___ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev