2013/1/29 Jim Breen <[email protected]>: > William_J_G Overington <[email protected]> wrote: > >> The idea is that there would be an additional UTF format, perhaps UTF-64, >> so that each character would be expressed in UTF-64 notation using 64 bits, >> thus providing error checking and correction facilities at a character level. > > Error detection and correction at the character level is considered > very old-fashioned now. Modern techniques such as Reed-Solomon > codes[1] are much more effective and involve much less overhead > than the 100% in the proposal above. Such techniques are already > used in modern disc storage[2], and when combined with RAID > techniques[3] provide better data protection than character-level > redundancy ever would. > > In any case, I think issues of error detection and correction are > quite outside the scope of Unicode.
Fully agree ! Character encodings should not depend at all about any error correction mechanisms that are out of scope and will be implemented in any upper or lower level meachnisms that will fit the underlying transport/storage/protocol/application infrastructure. There will be NO one-fits-all solution as sources of errors and how they are distributed is completely dependant of these external mechanisms. And anyway, character-level error correction is really inefficient and not even the best solution to prevent or correct errors, a much beter system will require handling them at the document level (for file-systems or web requests), or at the block level (e.g. RAID, or P2P distribution such as torrents), or at the datagram level (over networks). How you will recover these errors is fully dependant as well on the existence or non-existence of a return path and protocol : if such protocol exists and is available, it will be MORE effective to correct these errors. Autocorrection of errors is there when there's no return path, or no alternate paths (like mirrors). But they also depend on the strength of security (against malicious alterations), and security is also a domain where the data integritty is in constant evolution (in the algorithms used). This means that the long term conservation of documents MUST NOT depend on any of these mechanisms : Unicode MUST remain a standalone "black box" not working at any other level than plain-text, and offering absolutely NO security mechanims and NO error recovery mechanisms. For this reason, it has to define a standard interface (the code points, and a few standardized UTF encodings, plus the character encoding model), and nothing else. Any attempt to mix other standards will in fact make the standard MUCH LESS reliable for long term conservation of documents and will create new interoperability problems. By staying neutral about all other technologies, the Unicode standard will remain adaptable to all situations and we'll have the maximum interoperability with ALL security and error correction mechanisms, which will be specifically tuned to perform the BEST in their OWN context of use (storage, transport, integrity, security). And this will NEVER prevent those mechanisms to implement their own local reencoding, as long as the sequences of code points are preserved (some of them will require the plain-text to be normalized, and it is generally acceptable, according to the standard, that these *conformant* processes will preserve *at least* the canonical equivalences. Some of them will perform compatibility mappings, but for Unicode, these processes will be *lossy*, and they should not be used for preserving the data integrity or security, and connot be used for safe error correction mechanims — this includes corrections by orthographic spellers, which are performing guesses and remapping some distinct characters to other ones, e.g. changing dashes into ASCII hyphens). Some transport protocols will not support the preservation of Unicode texts : this is the case of SMS over basic GSM networks : these networks do not provide any data integrity mechanism except for a very reduced subset of the standard (for example it dos not preserve Greek or Cyrillic texts because they encode the same way the Latin letter A, the Cyrillic letter A and the Greek letter Alpha, assuming that the effective interpretation will be made according to an language indicated somewhere else in the transport, or assumed by the transport network or by client device settings). Once again, as long as these processes or protocols remain in their own context of use within which they are operating, they are safe and can be secured, but they can't leave with interoperability over other networks or storages (notably if the language indicator is lost, knowing that Unicode itself does not encode languages directly, as they are out of scope of the standard). So let's focus just on the Standard itself, not extending its scope of use. It's not up to TUS to regulate what other protocols will do. Unicode just describes some conformance levels : - level 1. preservation of code points - level 2. preservation of canonical equivalences - level 3. preservation of compatibility equivalences - level 4: all other conformant processes (including for example "best-fit" reencoders, or transliterators, or text-to-speech renderers). but this never means that other protocols DO HAVE to support one of these three standardized profiles. But if we need long term preservation of documents, only protocols and encodings preserving the level 1 should be used. So we'll still need external protocols that perform well in this level (Unicode UTFs are operating at this level 1, so they are safe to use when building conforming interfaces between various external processes or protocols, for data transmission or storage, and with maximum internoperability between all these level-1 conformant processes, independanly of what they will effectively transmit or store, but if these external processes fail at preserving this level 1, they MUST not say that they conform to this level and must not tolerate ANY deviation from it from another higher level).

