Re: [webkit-dev] Writing a new XML parser with no external libraries

Eric Seidel Tue, 28 Jun 2011 18:51:07 -0700

Correct.  We convert from UTF16 to UTF8 (for libxml2) and then back to UTF16.


There has been at least one libxml-related security fix to WebCore in
recent memory.

We have various hacks in the libxml2 parser due to libxml2 being
designed to be a library used by applications, and not used by a
library like WebKit:
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L373
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L488
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1093
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1182
http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1273


I'm in general in favor of this effort (having worked extensively on
the existing XML parsers).

But I would caution you that xml is a ridiculously tiny fraction of
the web.  And it may not be worth the engineering effort to make a
better parser.

http://www.google.com/search?q=filetype:html = 25,270,000,000
http://www.google.com/search?q=filetype:xml = 71,000,000

(Naively) judging by those numbers we should be spending 356 times as
much effort on our HTML support than our XML support. :)

-eric

-eric

On Tue, Jun 28, 2011 at 6:36 PM, Jeffrey Pfau <jp...@apple.com> wrote:
> I don't know all of the problems libxml2 has, but one of the ones I've heard 
> is that WebCore uses UTF-16 internally, and libxml2 uses UTF-8, so the data 
> is perpetually converted between the two formats--and this is slow. If there 
> are any other big ones, I haven't been told them, only that it would be good 
> to have a replacement.
>
> Jeffrey Pfau
>
> On Jun 28, 2011, at 6:30 PM, Dirk Pranke wrote:
>
>> Can you expand a bit more on "using libxml2 exposes its own share of 
>> problems"?
>>
>> -- Dirk
>>
>> On Tue, Jun 28, 2011 at 6:12 PM, Jeffrey Pfau <jp...@apple.com> wrote:
>>> Currently, WebCore uses libxml2, or, if available, QtXml to parse incoming 
>>> XML. However, QtXml isn't always available, and using libxml2 exposes its 
>>> own share of problems. As such, I'm undertaking writing an XML parser that 
>>> uses no external libraries.
>>>
>>> The first step to doing this is to add a new flag that switches off the 
>>> other two parsers. As the parsers are already independent and can be 
>>> switched between by checking USE(QXMLSTREAM), I am adding USE(LIBXML2) 
>>> checks, replacing the #else conditionals, and also a new ENABLE check, 
>>> tentatively called NEW_XML (although names such as NATIVE_XML or 
>>> XML_NATIVE, etc, may be more appropriate).
>>>
>>> As there will probably be a new slew of files pertaining to XML parsing, I 
>>> will put these files in WebCore/xml/parser, and move the existing 
>>> XMLDocumentParser* file into this new directory. As far as I know, the 
>>> placement of these files in WebCore/dom/ is legacy, and, assuming the build 
>>> on each platform is changed, it makes sense to move them.
>>>
>>> Once all the files are in a logical place, I plan to make a new file for a 
>>> skeleton of the new XMLDocumentParser, at least to get it to link until a 
>>> real one is in place, even if the XML parser at that point is just a data 
>>> sink.
>>>
>>> From there, I plan to copy and modify a good chunk of the lower level HTML 
>>> tokenization and parsing code, and make changes as necessary to make it 
>>> work on generalized XML, at least until I can generalize the common code in 
>>> such a way that the HTML and XML tokenizers can be subclasses and use 
>>> common code. I'd probably do the refactoring at the end.
>>>
>>> I'm still exploring the existing parsing code, but I'd probably work my way 
>>> up from there. I've read a lot of the XML 1.0 spec in preparation, as well, 
>>> but it doesn't have much on implementation itself. If QtWebKit or parsing 
>>> people have any comments, concerns, or help, I'd be more than willing to 
>>> listen--I'm just starting here, and I'm not completely familiar with the 
>>> codebase.
>>>
>>> Although no code is checked in so far, I've started on this list already 
>>> and have gotten as far as the new flags, a skeleton 
>>> XMLDocumentParserNew.cpp, and making a tokenizer that compiles and links, 
>>> but is completely untested.
>>>
>>> Jeffrey Pfau
>>> _______________________________________________
>>> webkit-dev mailing list
>>> webkit-dev@lists.webkit.org
>>> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>>>
>
> _______________________________________________
> webkit-dev mailing list
> webkit-dev@lists.webkit.org
> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev
>
_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

Re: [webkit-dev] Writing a new XML parser with no external libraries

Reply via email to