Kenneth Whistler wrote the following. >I think Marku's suggestion is correct. If you want to do >something like this internally to a process, use a noncharacter >code point for it. If you want to have visible display of this >kind of error handling for conversion, then simply declare a >convention for the use of an already existing character. >My suggestion would be: U+2620. ;-) Then get people to share >your convention.
I find this suggestion curious, particularly coming as it does from an officer of the Unicode Corporation. The U2600.pdf file has U+2620 under Warning signs and has = poison in its description. Suppose for example that the source document encoded in UTF-8 is a document about chemicals found around the house and that the U+2620 character is used to indicate those which are poisonous. If U+2620 is also used to include in visible form an indication of an error found during decoding, then finding a U+2620 character in the decoded document would lead to an ambiguous situation. One solution would be for the Unicode Consortium to encode an otherwise unused character especially for the purpose. If, however, the way forward is for an individual to declare a convention, then I suggest that a sequence of at least two characters, the first being a base character and the one or more others being combining items be used so as to produce an otherwise highly unlikely sequence of characters. For example, the character U+0304 COMBINING MACRON could be a good choice, as it could be used to indicate a Boolean "not" condition with a character which is otherwise unlikely to carry an accent. As to which character to use for the base character, I am undecided, however it should, in my opinion, not be U+2620 as that is a warning sign meaning poison and could lead to confusion if looking at a document. The advantage of a two character sequence is that a special piece of software may be used to parse all incoming documents. Only occurrences of the otherwise highly unlikely sequence will be regarded as indicating a conversion problem with the encoding. If either of the two characters used for the sequence is encountered other than with the rest of the sequence, then it will not indicate the special effect. In my comet circumflex system I use a three character detection sequence. This means that in order to enter the markup universe then all three characters of the sequence need to be present in sequence. Thus, a piece of software can scan all incoming text messages, even those which are not designed to fit in with the comet circumflex system, and not indicate a comet circumflex message if, say, a U+2604 COMET character arrives as part of a message. Using a two or three character sequence which is otherwise highly unlikely to occur is, in my opinion, a good way to indicate the presence of a special feature as it allows one to monitor all text files for the special feature without causing undesired responses on text files which have been prepared without any regard to the special feature. I feel that the influence of posting a suggestion in this mailing list is often greatly underestimated. If you do post a suggested two or three character sequence for the purpose that you seek, perhaps, if you wish, after further discussion in this group, my feeling is that that sequence may well become well known and accepted for the purpose very quickly, simply because where there is a need for such a sequence then, in the absence of any good reason not to do so, people will often happily use the suggested format. William Overington 1 November 2002

