On Fri, Aug 29, 2008 at 09:37:41AM +0200, Stefan Behnel wrote: > Hi, > > we got a report on the lxml list where someone tried to parse and > serialise a file that contains 8,000,000 non-ASCII character references > (‡), as in > > "<text>" + "‡" * 8000000 + "</text>" > > Parsing this is pretty fast, so that's not the problem, but serialising > this document back to a "US-ASCII" encoding, i.e. re-encoding the > non-ASCII characters as character references, is slow as hell. The user > stopped the run after 12 hours at 100% CPU load. I tried this with xmllint > and you can literally wait for each byte that arrives in the target file. > > Is there any reason why this is so, or does anyone have any insights what > the problem may be here? This definitely sounds like a bug to me.
Well that's an horribly crappy XML document. I assume the output buffer grows lineary, so you end up realloc'ing all the time and hit a quadratic behaviour as a result, somehow the reallocation of the buffer size should probably use a doubling at each step algorithm. Plus the escaping is done while the ASCII encoder stops. If I have 2mn i will try to look at this today before the 2.7.0 release BTW if people have a bit of time checking SVN lastest version for sanity should help. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ [EMAIL PROTECTED] | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
