Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Darin Adler
On Jan 5, 2011, at 8:38 AM, Patrick Gansterer wrote:

> Darin Adler:
> 
>> On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:
>> 
>>> Is there a reason why we can't pass the "raw" data to libxml2?
>> 
>> Because libxml2 does its own encoding detection which is not even close to 
>> what’s specified in HTML5, and supports far fewer encodings. If you make a 
>> test suite you will see.
> 
> Can you point me to the place of the XML encoding rules? After a short look 
> into the spec I didn't find something which applies to XML input encoding.
> AFAIK it's possible to teach libxml2 additional encodings.

I’m not sure where it is. I’ll let you know if I stumble on it later.

-- Darin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Patrick Gansterer
Darin Adler:

> On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:
> 
>> Is there a reason why we can't pass the "raw" data to libxml2?
> 
> Because libxml2 does its own encoding detection which is not even close to 
> what’s specified in HTML5, and supports far fewer encodings. If you make a 
> test suite you will see.

Can you point me to the place of the XML encoding rules? After a short look 
into the spec I didn't find something which applies to XML input encoding.
AFAIK it's possible to teach libxml2 additional encodings.

> On the other hand, you could probably make a path that lets libxml2 do the 
> decoding for the most common encodings when specified in a way that we know 
> libxml2 detects correctly, after doing some testing to see if it handles 
> everything right.

That's something I'd like to do, but I need some time when I can do it. ;-) My 
first step was to improve the performance of libxml2 -> WebKit.

- Patrick
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Darin Adler
On Jan 5, 2011, at 5:07 AM, Patrick Gansterer wrote:

> Is there a reason why we can't pass the "raw" data to libxml2?

Because libxml2 does its own encoding detection which is not even close to 
what’s specified in HTML5, and supports far fewer encodings. If you make a test 
suite you will see.

On the other hand, you could probably make a path that lets libxml2 do the 
decoding for the most common encodings when specified in a way that we know 
libxml2 detects correctly, after doing some testing to see if it handles 
everything right.

-- Darin

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Alex Milowski
On Tue, Jan 4, 2011 at 7:14 PM, Eric Seidel  wrote:
> You should feel encouraged to speak with dv (http://veillard.com/)
> more about this issue.
>
> Certainly I'd love to get rid of the hack, but I gave up after that
> email exchange.

In the shorter term, fixing this "bug" or "lack of feature" in libxml2
would be ideal.  I need to understand the two different ways we invoke
XML parsing a bit better.  We bootstrap the libxml2 parser slightly
different depending on whether it is a "string parser" or a "memory
parser".   Why is there a difference?

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Alex Milowski
On Wed, Jan 5, 2011 at 5:07 AM, Patrick Gansterer  wrote:
>
> Is there a reason why we can't pass the "raw" data to libxml2?
> E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 
> converts it back into UTF-8 (its internal format). This is a real performance 
> problem when parsing XML [1].
> Is there some (required) magic involved when detecting the encoding in 
> WebKit? AFAIK XML always defaults to UTF-8 if there's no encoding declared.
> Can we make libxml2 do the encoding detection and provide all of our decoders 
> so it can use it?
>
> [1] https://bugs.webkit.org/show_bug.cgi?id=43085
>

Looking at that bug, the "XSLT argument" is a red herring.  We don't
use libxml's data structures and so when we use libxslt we either turn
the XML parser completely over to libxslt or we serialize and re-parse
(that's how the javascript-invoked XLST works).  In both cases, we're
probably incurring a penalty for this double decoding of Unicode
encodings.

A native XML parser for WebKit would help in the situation where you
aren't using XSLT.  Only a native or different XSLT processor in
conjunction with a native XML parser would help in all cases.

The XSLT processor question is a thorny one that I brought up awhile
ago.  I personally would love to see us use a processor that has
better integration with WebKit's API.  There are a handful of choices
but many of them are XSLT 2.0.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-05 Thread Patrick Gansterer

Alex Milowski:

> On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov  wrote:
>> 
>> 04.01.2011, в 18:40, Alex Milowski написал(а):
>> 
>>> Looking at the libxml2 API, I've been baffled myself about how to
>>> control the character encoding from the outside.  This looks like a
>>> serious lack of an essential feature.  Anyone know about this above
>>> "hack" and can provide more detail?
>> 
>> 
>> Here is some history: 
>> , 
>> .
> 
> Well, that is some interesting history.  *sigh*
> 
> I take it the "work around" is that data is read and decoded into an
> internal string which is represented by a sequence of UChar.  As such,
> we treat it as UTF16 character encoded data and feed that to the
> parser, forcing it to use UTF16 every time.
> 
> Too bad we can't just tell it the proper encoding--possibly the one
> from the transport--and let it do the decoding on the raw data.  Of
> course, that doesn't guarantee a better result.

Is there a reason why we can't pass the "raw" data to libxml2?
E.g. when the input file is UTF-8 we convert it into UTF-16 and then libxml2 
converts it back into UTF-8 (its internal format). This is a real performance 
problem when parsing XML [1].
Is there some (required) magic involved when detecting the encoding in WebKit? 
AFAIK XML always defaults to UTF-8 if there's no encoding declared.
Can we make libxml2 do the encoding detection and provide all of our decoders 
so it can use it?

[1] https://bugs.webkit.org/show_bug.cgi?id=43085

- Patrick 

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-04 Thread Alex Milowski
On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov  wrote:
>
> 04.01.2011, в 18:40, Alex Milowski написал(а):
>
>> Looking at the libxml2 API, I've been baffled myself about how to
>> control the character encoding from the outside.  This looks like a
>> serious lack of an essential feature.  Anyone know about this above
>> "hack" and can provide more detail?
>
>
> Here is some history: 
> , 
> .

Well, that is some interesting history.  *sigh*

I take it the "work around" is that data is read and decoded into an
internal string which is represented by a sequence of UChar.  As such,
we treat it as UTF16 character encoded data and feed that to the
parser, forcing it to use UTF16 every time.

Too bad we can't just tell it the proper encoding--possibly the one
from the transport--and let it do the decoding on the raw data.  Of
course, that doesn't guarantee a better result.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-04 Thread Eric Seidel
You should feel encouraged to speak with dv (http://veillard.com/)
more about this issue.

Certainly I'd love to get rid of the hack, but I gave up after that
email exchange.

-eric

On Tue, Jan 4, 2011 at 7:05 PM, Alexey Proskuryakov  wrote:
>
> 04.01.2011, в 18:40, Alex Milowski написал(а):
>
>> Looking at the libxml2 API, I've been baffled myself about how to
>> control the character encoding from the outside.  This looks like a
>> serious lack of an essential feature.  Anyone know about this above
>> "hack" and can provide more detail?
>
>
> Here is some history: 
> , 
> .
>
> - WBR, Alexey Proskuryakov
>
> ___
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


Re: [webkit-dev] libxml2 "override encoding" support

2011-01-04 Thread Alexey Proskuryakov

04.01.2011, в 18:40, Alex Milowski написал(а):

> Looking at the libxml2 API, I've been baffled myself about how to
> control the character encoding from the outside.  This looks like a
> serious lack of an essential feature.  Anyone know about this above
> "hack" and can provide more detail?


Here is some history: 
, 
.

- WBR, Alexey Proskuryakov

___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev


[webkit-dev] libxml2 "override encoding" support

2011-01-04 Thread Alex Milowski
I'm working through some rather thorny experiments with new XML
support within the browser and I ran into this snippet:

static void switchToUTF16(xmlParserCtxtPtr ctxt)
{
// Hack around libxml2's lack of encoding overide support by manually
// resetting the encoding to UTF-16 before every chunk.  Otherwise libxml
// will detect "?> blocks
// and switch encodings, causing the parse to fail.
const UChar BOM = 0xFEFF;
const unsigned char BOMHighByte = *reinterpret_cast(&BOM);
xmlSwitchEncoding(ctxt, BOMHighByte == 0xFF ?
XML_CHAR_ENCODING_UTF16LE : XML_CHAR_ENCODING_UTF16BE);
}

Looking at the libxml2 API, I've been baffled myself about how to
control the character encoding from the outside.  This looks like a
serious lack of an essential feature.  Anyone know about this above
"hack" and can provide more detail?

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics
___
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev