At 11:41 AM 2/3/02 +0900, Gaspar Sinai wrote: >Unicode and Security > >I would like to start a series of discussion about >the security aspects of Unicode.
At the outset, before we can have a discussion, we would need to define what the meaning of 'security' is. Some see security issues where virus-like data can be injected into a system via an API taking Unicode strings. (There was a longer discussion on the list on *that* issue). Then we had the discussion on bypassing syntax checking for path names, by using non-shortest encodings in UTF-8. Your topics revolve around different aspects of uniqueness. You desire: a) unique storage order for bidi texts b) unique storage for the same character shapes c) unique storage for the same letter sequence In this context you ask: >Is Unicode secure? What character standards can be >considered secure? Any logical to visual ordering of data is non-unique in the general case. (Simple cases may be unique, but at least some complex cases aren't). Any character set supporting bidirectional ordering is subject to this issue. Since transmitting pre-ordered data prevents such things as re-flowing browser windows it is not generally acceptable for Arabic, Farsi, Hebrew, Syriac or Urdu data. Therefore, insisting on case (a) will disallow these languages. >I had the following problems where unicode could not >be used because of security issues. In all cases >the signer of a document can be lured into >believing that the wording of the document he/she >is about to sign is different. > >How can it be? I had the following problems: > >1. Character Order Problem > > The BIDI algorithm is too complex and not reversible. > I could create a BIDI document where only RLO LRO and > PDF characters were used, and the WORD, JAVA and KDE > produced different word ordering. I don't have access > to MS platform now to reproduce this but as far as > I can tell it was like: > > <RLO>text1<PDF>U+0020<RLO>text2<PDF> > > Because the BIDI algorithm is too complex and vague > it can be said that these programs all displayed > the text correctly, still differently. > > text1 text2 > text2 text1 The bidi algorithm is anything but vague. Any implementation can be rigorously tested against two reference implementations, to ensure fully compatible implementation. The problem is that some environments deliberately deviate from it for good and bad reasons. The 'bad' reason is that the algorithm (without overrides) occasionally has to pick a default treatment of a symbol (e.g. is '/' going to work correct for dates or for fractions). Some environments change the algorithm since either fractions or dates are so prevalent, that they feel the correct solution (adding overrides) is not realizable. The 'good' reason applies in cases such as WORD, where we are *not* talking about *plain* text. In rich text, all runs can have fully resolved directionality at all times, making the bidi algorithm necessary only on plain text import and export. Some of the features of fully resolved text (where the directions are kept in style information) are hard to duplicate in plain text, except by liberally using overrides (which not all text recipients handle well). These are the two cases, perhaps 'good' and 'bad' should be assigned the other way around... >2. Character Shape Problem > > I had different character shapes, because: > a) Ligatures > In complex scripts, in Devanagari for instance the > ZERO WITH JOINER should be used to prevent ligature > forming and normally join the characters. > > Whether ligature forming will actually happen or not > is completely up to the font. If the font does have > the ligature, it will be formed. The standard does > not define all the compulsory ligatures. > > I was even thinking about putting ZERO WITH JOINER > after each character. But why we have ZERO WITH JOINER > at all? I think a ZERO WITH LIGATURE FORMER would > be better. In this case at least I would know that > a ligature may appear at that point. The problem that ligatures are font dependent remains. We cannot do Devanagari without some way of asking for ligatures, and as long as we are not standardizing the *fonts* then this problem remains as the final display will depend on the font. This is an issue for all scripts for which specifying ligatures (or preventing them) is either strictly required, or at least a common practice. In a latin example, using the 'fi' comaptibility character, or using f and i with the a font that ligates, and an application that enables this feature of the font will give two different backing stores for the same appearance. Note that the Mac character set has an 'fi' in it - so this is not at all unique to Unicode. > b) Hidden Marks > It is possible to make a combining mark, like a > negation mark appear in the base characters body > making it invisible. It is nearly impossible to > test the rendering engine for all possible > combinations. This is no different from any other forms of spoofing. You could use A, A and A, where each are from the Latin, Greek and Cyrillic script, for example. Or, you could use a font where 1 and l or I and l, or even O and 0 look the same, and then you can get the same result in ASCII. >3. Text Search Problem > > It is possible to create texts that look the same, > but the can not be searched because even when fully > decomposed and ordered they will be different. I think this is not a new category, but the summary of cases a, b, and c, above. I've tried to show that many of the examples are related to the fact that the script in question does not follow these simple rules: R1. Each symbol has a unique appearance R2. Each symbol has an unchanging visual appearance R3. Each symbol has a deterministic location in the output Requiring R1 eliminates practically all multilingual character codes, including limited ones, such as ISO/IEC 8859-7 for example (Latin/Greek). Requiring R2 eliminates any and all scripts with certain forms of character shaping, ligating, or conjunct formation requirements. A fully secure system that cannot handle such data is of limited use in a global economy. Requiring R3 eliminates bidirectional scripts. I'm not trying to negate that these are challenges for con- structing secure systems. What I'm trying to get across is that these issues are not caused by the character encoding, but by what the encoding encodes. Therefore, the challenge needs to be to find ways to address these security concerns that do not disallow global or multilingual data. Finding such ways would be a worthwhile discussion. A./

