"Andy Heninger" <[EMAIL PROTECTED]> writes:

> Within the DOM, utf-8 vs utf-16 strings makes another interesting
> question.  For data that is consists of mostly Latin characters, utf-8
> strings are about half the size.  And for the DOM, with big documents
> being 100% memory resident, that would be a significant saving.  The W3C
> DOM API does call for utf-16, but that could probably be hand-waved away -
> all of the data is there, the character access and substring functions in
> the DOM API could work correctly (in terms of utf-16 units, as specified
> by W3C), and all of the necessary information is present.  If we were
> starting everything from scratch, both the parser and the DOM, I would
> lobby for utf-8 everywhere.  But I don't think that the DOM should deviate
> from what the scanner is doing, which is to say, I think that we should
> stick with utf-16 strings.

This might make implementation a nightmare, but why not support a
compile time option that would enable building it with utf-8 or utf-16
support internally? Many of us are only planning on parsing Latin
character documents and want the DOM to be small and fast. Those that
want internall support for wide characters could get it at a price (in
performance).

That said, in the documents that I deal with (primarily scientific
data), my main memory usage is primarily the per-node overhead that
Andy pointed out in his profiling. I don't have that much text
(numbers) in the data, just 10^6 to 10^7 nodes. So really keeping the
internal nodes lean and mean would mean I wouldn't have to switch to
SAX to get my job done...

jas.

PS. I would love to assist in moving the new implementation forward. 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to