Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Patrick Gansterer

Alex Milowski:

 On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote:
 
 04.01.2011, в 18:40, Alex Milowski написал(а):
 
 Looking at the libxml2 API, I've been baffled myself about how to
 control the character encoding from the outside.  This looks like a
 serious lack of an essential feature.  Anyone know about this above
 hack and can provide more detail?
 
 
 Here is some history: 
 http://mail.gnome.org/archives/xml/2006-February/msg00052.html, 
 https://bugzilla.gnome.org/show_bug.cgi?id=614333.
 
 Well, that is some interesting history.  *sigh*
 
 I take it the work around is that data is read and decoded into an
 internal string which is represented by a sequence of UChar.  As such,
 we treat it as UTF16 character encoded data and feed that to the
 parser, forcing it to use UTF16 every time.
 
 Too bad we can't just tell it the proper encoding--possibly the one
 from the transport--and let it do the decoding on the raw data.  Of
 course, that doesn't guarantee a better result.

Is there a reason why we can't pass the raw data to libxml2?
E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 
converts it back into UTF-8 (its internal format). This is a real performance 
problem when parsing XML [1].
Is there some (required) magic involved when detecting the encoding in WebKit? 
AFAIK XML always defaults to UTF-8 if there's no encoding declared.
Can we make libxml2 do the encoding detection and provide all of our decoders 
so it can use it?

[1] https://bugs.webkit.org/show_bug.cgi?id=43085

- Patrick 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Alex Milowski
On Wed, Jan 5, 2011 at 5:07 AM, Patrick Gansterer par...@paroga.com wrote:

 Is there a reason why we can't pass the raw data to libxml2?
 E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 
 converts it back into UTF-8 (its internal format). This is a real performance 
 problem when parsing XML [1].
 Is there some (required) magic involved when detecting the encoding in 
 WebKit? AFAIK XML always defaults to UTF-8 if there's no encoding declared.
 Can we make libxml2 do the encoding detection and provide all of our decoders 
 so it can use it?

 [1] https://bugs.webkit.org/show_bug.cgi?id=43085


Looking at that bug, the XSLT argument is a red herring.  We don't
use libxml's data structures and so when we use libxslt we either turn
the XML parser completely over to libxslt or we serialize and re-parse
(that's how the javascript-invoked XLST works).  In both cases, we're
probably incurring a penalty for this double decoding of Unicode
encodings.

A native XML parser for WebKit would help in the situation where you
aren't using XSLT.  Only a native or different XSLT processor in
conjunction with a native XML parser would help in all cases.

The XSLT processor question is a thorny one that I brought up awhile
ago.  I personally would love to see us use a processor that has
better integration with WebKit's API.  There are a handful of choices
but many of them are XSLT 2.0.

-- 
--Alex Milowski
The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered.

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Alex Milowski
On Tue, Jan 4, 2011 at 7:14 PM, Eric Seidel e...@webkit.org wrote:
 You should feel encouraged to speak with dv (http://veillard.com/)
 more about this issue.

 Certainly I'd love to get rid of the hack, but I gave up after that
 email exchange.

In the shorter term, fixing this bug or lack of feature in libxml2
would be ideal.  I need to understand the two different ways we invoke
XML parsing a bit better.  We bootstrap the libxml2 parser slightly
different depending on whether it is a string parser or a memory
parser.   Why is there a difference?

-- 
--Alex Milowski
The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered.

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Darin Adler
On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:

 Is there a reason why we can't pass the raw data to libxml2?

Because libxml2 does its own encoding detection which is not even close to 
what’s specified in HTML5, and supports far fewer encodings. If you make a test 
suite you will see.

On the other hand, you could probably make a path that lets libxml2 do the 
decoding for the most common encodings when specified in a way that we know 
libxml2 detects correctly, after doing some testing to see if it handles 
everything right.

-- Darin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Patrick Gansterer
Darin Adler:

 On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:
 
 Is there a reason why we can't pass the raw data to libxml2?
 
 Because libxml2 does its own encoding detection which is not even close to 
 what’s specified in HTML5, and supports far fewer encodings. If you make a 
 test suite you will see.

Can you point me to the place of the XML encoding rules? After a short look 
into the spec I didn't find something which applies to XML input encoding.
AFAIK it's possible to teach libxml2 additional encodings.

 On the other hand, you could probably make a path that lets libxml2 do the 
 decoding for the most common encodings when specified in a way that we know 
 libxml2 detects correctly, after doing some testing to see if it handles 
 everything right.

That's something I'd like to do, but I need some time when I can do it. ;-) My 
first step was to improve the performance of libxml2 - WebKit.

- Patrick
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-05 Thread Darin Adler
On Jan 5, 2011, at 8:38 AM, Patrick Gansterer wrote:

 Darin Adler:
 
 On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:
 
 Is there a reason why we can't pass the raw data to libxml2?
 
 Because libxml2 does its own encoding detection which is not even close to 
 what’s specified in HTML5, and supports far fewer encodings. If you make a 
 test suite you will see.
 
 Can you point me to the place of the XML encoding rules? After a short look 
 into the spec I didn't find something which applies to XML input encoding.
 AFAIK it's possible to teach libxml2 additional encodings.

I’m not sure where it is. I’ll let you know if I stumble on it later.

-- Darin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


[webkit-dev] libxml2 override encoding support

2011-01-04 Thread Alex Milowski
I'm working through some rather thorny experiments with new XML
support within the browser and I ran into this snippet:

static void switchToUTF16(xmlParserCtxtPtr ctxt)
{
// Hack around libxml2's lack of encoding overide support by manually
// resetting the encoding to UTF-16 before every chunk.  Otherwise libxml
// will detect ?xml version=1.0 encoding=encoding name? blocks
// and switch encodings, causing the parse to fail.
const UChar BOM = 0xFEFF;
const unsigned char BOMHighByte = *reinterpret_castconst unsigned
char*(BOM);
xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ?
XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);
}

Looking at the libxml2 API, I've been baffled myself about how to
control the character encoding from the outside.  This looks like a
serious lack of an essential feature.  Anyone know about this above
hack and can provide more detail?

-- 
--Alex Milowski
The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered.

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-04 Thread Alexey Proskuryakov

04.01.2011, в 18:40, Alex Milowski написал(а):

 Looking at the libxml2 API, I've been baffled myself about how to
 control the character encoding from the outside.  This looks like a
 serious lack of an essential feature.  Anyone know about this above
 hack and can provide more detail?


Here is some history: 
http://mail.gnome.org/archives/xml/2006-February/msg00052.html, 
https://bugzilla.gnome.org/show_bug.cgi?id=614333.

- WBR, Alexey Proskuryakov

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-04 Thread Eric Seidel
You should feel encouraged to speak with dv (http://veillard.com/)
more about this issue.

Certainly I'd love to get rid of the hack, but I gave up after that
email exchange.

-eric

On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote:

 04.01.2011, в 18:40, Alex Milowski написал(а):

 Looking at the libxml2 API, I've been baffled myself about how to
 control the character encoding from the outside.  This looks like a
 serious lack of an essential feature.  Anyone know about this above
 hack and can provide more detail?


 Here is some history: 
 http://mail.gnome.org/archives/xml/2006-February/msg00052.html, 
 https://bugzilla.gnome.org/show_bug.cgi?id=614333.

 - WBR, Alexey Proskuryakov

 ___
 webkit-dev mailing list
 webkit-dev@lists.webkit.org
 http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 override encoding support

2011-01-04 Thread Alex Milowski
On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov a...@webkit.org wrote:

 04.01.2011, в 18:40, Alex Milowski написал(а):

 Looking at the libxml2 API, I've been baffled myself about how to
 control the character encoding from the outside.  This looks like a
 serious lack of an essential feature.  Anyone know about this above
 hack and can provide more detail?


 Here is some history: 
 http://mail.gnome.org/archives/xml/2006-February/msg00052.html, 
 https://bugzilla.gnome.org/show_bug.cgi?id=614333.

Well, that is some interesting history.  *sigh*

I take it the work around is that data is read and decoded into an
internal string which is represented by a sequence of UChar.  As such,
we treat it as UTF16 character encoded data and feed that to the
parser, forcing it to use UTF16 every time.

Too bad we can't just tell it the proper encoding--possibly the one
from the transport--and let it do the decoding on the raw data.  Of
course, that doesn't guarantee a better result.

-- 
--Alex Milowski
The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered.

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev