On 10/21/2012 4:09 AM, Philippe Verdy wrote:
Unless there's a way to rebuild the metadata unambiguously or to enforce
>that it is complete and correct, it's very hard to rely on it for any
>particular purpose.
Enforcing that the metadata is correct is perfectly possible, at least
to ensure that it matches the requirements. (For example, an incorrect
encoding, given in metadata, should be signaled each time it violates
one of its rules : this is possible for many text standardized
encodings, including UTF's).

It may be possible to do some verification of well-formedness for well-designed encoding schemes like the UTFs but, pray, how do you tell apart 8859-1 from 8859-15?

These are not rarely occurring character sets and enforcement for them, as for any of the other 8859 series would only be possible if you were to do the very same character-set sniffing that you so dislike.

If you run a variation of a language detector, it's possible to detect, for example, that the text is in Icelandic, and therefore requires 8859-1 instead of 8859-15. That is because the few code points that are mapped to different characters in these two sets would be appearing (statistically) in the wrong context.

This is something a clever text editing (or HTML editing) tool could do, but not something that you can build into an OS.

Anyway, to cut the discussion short, I'd love to see a working example of any system where metadata are 100% reliable.

A./

Reply via email to