Re: Nicest UTF

Philippe Verdy Fri, 03 Dec 2004 13:58:57 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>

I appreciate Philippe's support of SCSU, but I don't think *even I*
would recommend it as an internal storage format.  The effort to encode
and decode it, while by no means Herculean as often perceived, is not
trivial once you step outside Latin-1.

I said: "for immutable strings", which means that these Strings are instanciated for long term, and multiple reuses. In that sense, what is really significant is its decoding, not the effort to encode it (which is minimal for ISO-8859-1 encoded source texts, or Unicode UTF-encoded texts that only use characters from the first page).

Decoding SCSU is very straightforward, even if this is stateful (at the internal character level). But for immutable strings, there's no need to handle various initial states, and the states associated with each conponent character of the string has no importance (strings being immutable, only the decoding of the string as a whole makes sense).

The stateful decoding of SCSU can be part of an accessor from a storage class, which can also be optimized easily to avoid multiple reallocations of the decoded buffer.

SCSU can only be a complication if you want mutable strings; however mutable strings are needed only if you intend to transform a source text and work on its content. If this is a temporary need to create other immutable strings, you can still use SCSU for encoding the final results, and work with UTFs for intermediate results.

In a text editor, where you'll constantly need to work at the character level, the text is not immutable, and this is effectively not a good encoding for working on it (but all UTFs, including UTF-8 or GB18030) are easy to work with at this level.

In practice, a text editor often needs to split the edited text into manageable fragments encoded separately, for performance reason (as text insertion and deletion in a large buffer is a lengthy and costly operation). Given that UTFs can increase the memory need, it is not completly stupid to think about using a compression scheme for individual fragments of the large text file; the cost of encoding/decoding SCSU, if this limits the number VM swaps to the disk to access to more fragments, can be an interesting optimization, as the total size on disk will be smaller, reducing the number of I/O operations, and so enhancing the program responsiveness to user commands.

(Note that there already exists applications of such compression schemes even within filesystems that support editable but still compressed files... SCSU is not the option used in this case, because it is too specific to Unicode texts, but they use a much more complex compression scheme, most often derived from Lempel-Ziv-Welsh compression algorithms, and this is not significantly increasing the total load time, given that this also significantly reduces the frequency of disk I/O, which is a much longer and costly operation...)

The bad thing about SCSU is that the compression scheme is not deterministic: you can't compare easily too instances of strings encoded with SCSU (because several alternatives are possible) without actually decoding it prior to performing their collation (with standard UTFs, including the chinese GB18030 standard, the encoding is deterministic and allows comparing encoded strings without first decoding them).

But this argument is also true for almost all compression schemes, even for the well-known "deflate" algorithm or for very basic compressors like RLE, or a newer "bzip2" compression (depending on the compressor implementation used and some tunable parameters, and the number of alternatives and size of internal dictionaries considered during the compression).

The advantage of SCSU over generic data compressors like "deflate" is that it does not require a large and complex state (all the SCSU decoding states are managed with a very limited number of fixed-sized variables), so its decompression can be easily hardcoded and optimized a lot, up to a point were the cost of decompression will be nearly invisible to almost all applications: the most significant costs will be most often within collators or text parsers; a compliant UCA collation algorithm is much more complex to implement and optimize than a SCSU decompressor, and it is more CPU- and resource-intensive.

Re: Nicest UTF

Reply via email to