On Fri, Jul 10, 2009 at 11:04:38AM +0200, Diego Santa Cruz wrote:
> Hi,
>
> I've sent a couple of patches to the mailing list last week about speeding up
> the parsing of very large attributes when using the reader API, although I
> think they did not go through the mailing list. I would like to get some
> feedback and contribute these to libxml2, so here they are again, sorry if
> you get this in double.
>
> To summarize the problem we are hitting is that xmlTextReaderPushData()
> parses in chunk of 512 bytes but xmlParseChunk() will wait until having a
> complete tag to proceed with parsing. This is normally OK but when the
> attribute is very large (e.g., 600 K, with inline image data in data hrefs),
> ends up having O(n^2) complexity as each call to xmlParseChunk() will rescan
> the complete accumulated buffer for the start and end of tag. For instance,
> on an ARM9 at 300 MHz it takes 29 sec to parse such a node with a 600 K
> attribute.
>
> The first patch below (patch #1) takes the simple approach of doubling the
> chunk size each time a call to xmlParseChunk() does not parse anything and
> resetting back to 512 bytes whenever something is parsed. It takes down the
> parsing time from 29s to 490 ms for our test file. The problem I see with
> this approach is that we may end up parsing a lot of stuff on a call to
> xmlParseChunk() leading to a very large tree. For instance if a 514 K buffer
> is required to hold the complete tag and we end up passing a chunk of 1024 K
> to the parser.
>
> The second thing I have tried (pacth #2 below) is to start looking for a
> potential end of tag ('>') when a call to xmlParseChunk() does not parse
> anything. The loop in xmlTextReaderPushData() does not call xmlParseChunk()
> until a '>' is found. When the '>' is found it calls xmlParseChunk() with the
> chunk size set to the end of the '>'. With this the parsing time is 450 ms,
> which is just a bit better than the first approach but avoids passing
> needlessly large chunks to xmlParseChunk().
>
> I am very new to libxml2 internals, so I would really appreciate some
> feedback. Are these acceptable ways to speed up parsing of this kind of xml
> docs? Are there any adverse side effects of doing this?
I saw your patches, on bugzilla too, but I hadn't any time yet to
check them, I hope to do that in the near future. The pre-scanning
in xmlParseChunk() can be tricky, so this really need some serious
attention. As a first test make sure that "make check" passes okay
on a git checkout with the patch applied, that will help building
trust :-)
Daniel
--
Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
[email protected] | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library http://libvirt.org/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml