On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote:
> Hi,
> 
> we got a report on the lxml list where someone tried to parse and
> serialise a file that contains 8,000,000 non-ASCII character references
> (‡), as in
> 
>     "<text>" + "&#135;" * 8000000 + "</text>"
> 
> Parsing this is pretty fast, so that's not the problem, but serialising
> this document back to a "US-ASCII" encoding, i.e. re-encoding the
> non-ASCII characters as character references, is slow as hell. The user
> stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
> and you can literally wait for each byte that arrives in the target file.
> 
> Is there any reason why this is so, or does anyone have any insights what
> the problem may be here? This definitely sounds like a bug to me.

  Well that's an horribly crappy XML document.
I assume the output buffer grows lineary, so you end up realloc'ing all
the time and hit a quadratic behaviour as a result, somehow the
reallocation of the buffer size should probably use a doubling at each
step algorithm. Plus the escaping is done while the ASCII encoder stops.

If I have 2mn i will try to look at this today before the 2.7.0 release

  BTW if people have a bit of time checking SVN lastest version for
sanity should help.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
[EMAIL PROTECTED]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to