Correct. We convert from UTF16 to UTF8 (for libxml2) and then back to UTF16.
There has been at least one libxml-related security fix to WebCore in recent memory. We have various hacks in the libxml2 parser due to libxml2 being designed to be a library used by applications, and not used by a library like WebKit: http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L373 http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L488 http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1093 http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1182 http://trac.webkit.org/browser/trunk/Source/WebCore/dom/XMLDocumentParserLibxml2.cpp#L1273 I'm in general in favor of this effort (having worked extensively on the existing XML parsers). But I would caution you that xml is a ridiculously tiny fraction of the web. And it may not be worth the engineering effort to make a better parser. http://www.google.com/search?q=filetype:html = 25,270,000,000 http://www.google.com/search?q=filetype:xml = 71,000,000 (Naively) judging by those numbers we should be spending 356 times as much effort on our HTML support than our XML support. :) -eric -eric On Tue, Jun 28, 2011 at 6:36 PM, Jeffrey Pfau <jp...@apple.com> wrote: > I don't know all of the problems libxml2 has, but one of the ones I've heard > is that WebCore uses UTF-16 internally, and libxml2 uses UTF-8, so the data > is perpetually converted between the two formats--and this is slow. If there > are any other big ones, I haven't been told them, only that it would be good > to have a replacement. > > Jeffrey Pfau > > On Jun 28, 2011, at 6:30 PM, Dirk Pranke wrote: > >> Can you expand a bit more on "using libxml2 exposes its own share of >> problems"? >> >> -- Dirk >> >> On Tue, Jun 28, 2011 at 6:12 PM, Jeffrey Pfau <jp...@apple.com> wrote: >>> Currently, WebCore uses libxml2, or, if available, QtXml to parse incoming >>> XML. However, QtXml isn't always available, and using libxml2 exposes its >>> own share of problems. As such, I'm undertaking writing an XML parser that >>> uses no external libraries. >>> >>> The first step to doing this is to add a new flag that switches off the >>> other two parsers. As the parsers are already independent and can be >>> switched between by checking USE(QXMLSTREAM), I am adding USE(LIBXML2) >>> checks, replacing the #else conditionals, and also a new ENABLE check, >>> tentatively called NEW_XML (although names such as NATIVE_XML or >>> XML_NATIVE, etc, may be more appropriate). >>> >>> As there will probably be a new slew of files pertaining to XML parsing, I >>> will put these files in WebCore/xml/parser, and move the existing >>> XMLDocumentParser* file into this new directory. As far as I know, the >>> placement of these files in WebCore/dom/ is legacy, and, assuming the build >>> on each platform is changed, it makes sense to move them. >>> >>> Once all the files are in a logical place, I plan to make a new file for a >>> skeleton of the new XMLDocumentParser, at least to get it to link until a >>> real one is in place, even if the XML parser at that point is just a data >>> sink. >>> >>> From there, I plan to copy and modify a good chunk of the lower level HTML >>> tokenization and parsing code, and make changes as necessary to make it >>> work on generalized XML, at least until I can generalize the common code in >>> such a way that the HTML and XML tokenizers can be subclasses and use >>> common code. I'd probably do the refactoring at the end. >>> >>> I'm still exploring the existing parsing code, but I'd probably work my way >>> up from there. I've read a lot of the XML 1.0 spec in preparation, as well, >>> but it doesn't have much on implementation itself. If QtWebKit or parsing >>> people have any comments, concerns, or help, I'd be more than willing to >>> listen--I'm just starting here, and I'm not completely familiar with the >>> codebase. >>> >>> Although no code is checked in so far, I've started on this list already >>> and have gotten as far as the new flags, a skeleton >>> XMLDocumentParserNew.cpp, and making a tokenizer that compiles and links, >>> but is completely untested. >>> >>> Jeffrey Pfau >>> _______________________________________________ >>> webkit-dev mailing list >>> webkit-dev@lists.webkit.org >>> http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev >>> > > _______________________________________________ > webkit-dev mailing list > webkit-dev@lists.webkit.org > http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev > _______________________________________________ webkit-dev mailing list webkit-dev@lists.webkit.org http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev