Re: Unicode and Security

Asmus Freytag Sun, 03 Feb 2002 00:11:28 -0800

At 11:41 AM 2/3/02 +0900, Gaspar Sinai wrote:
>Unicode and Security
>
>I would like to start a series of discussion about
>the security aspects of Unicode.


At the outset, before we can have a discussion, we
would need to define what the meaning of 'security'
is.

Some see security issues where virus-like data can
be injected into a system via an API taking Unicode
strings. (There was a longer discussion on the list
on *that* issue).

Then we had the discussion on bypassing syntax
checking for path names, by using non-shortest
encodings in UTF-8.

Your topics revolve around different aspects of
uniqueness. You desire:

a) unique storage order for bidi texts

b) unique storage for the same character shapes

c) unique storage for the same letter sequence

In this context you ask:

>Is Unicode secure? What character standards can be
>considered secure?

Any logical to visual ordering of data is non-unique
in the general case. (Simple cases may be unique,
but at least some complex cases aren't). Any character
set supporting bidirectional ordering is subject to
this issue. Since transmitting pre-ordered data
prevents such things as re-flowing browser windows
it is not generally acceptable for Arabic, Farsi,
Hebrew, Syriac or Urdu data. Therefore, insisting
on case (a) will disallow these languages.



>I had the following problems where unicode could not
>be used because of security issues. In all cases
>the signer of  a document can be lured into
>believing that the wording of the document he/she
>is about to sign is different.
>
>How can it be? I had the following problems:
>
>1. Character Order Problem
>
>    The BIDI algorithm is too complex and not reversible.
>    I could create a BIDI document where only RLO LRO and
>    PDF characters were used, and the WORD, JAVA and KDE
>    produced different word ordering. I don't have access
>    to MS platform  now to reproduce this but as far as
>    I can tell it was like:
>
>     <RLO>text1<PDF>U+0020<RLO>text2<PDF>
>
>    Because the BIDI algorithm is too complex and vague
>    it can be said that these programs all displayed
>    the text correctly, still differently.
>
>       text1 text2
>       text2 text1

The bidi algorithm is anything but vague. Any
implementation can be rigorously tested against two
reference implementations, to ensure fully compatible
implementation.

The problem is that some environments deliberately deviate
from it for good and bad reasons. The 'bad' reason is that
the algorithm (without overrides) occasionally has to pick
a default treatment of a symbol (e.g. is '/' going to work
correct for dates or for fractions). Some environments
change the algorithm since either fractions or dates are
so prevalent, that they feel the correct solution (adding
overrides) is not realizable.

The 'good' reason applies in cases such as WORD, where we
are *not* talking about *plain* text. In rich text, all
runs can have fully resolved directionality at all times,
making the bidi algorithm necessary only on plain text
import and export. Some of the features of fully resolved
text (where the directions are kept in style information)
are hard to duplicate in plain text, except by liberally
using overrides (which not all text recipients handle well).

These are the two cases, perhaps 'good' and 'bad' should
be assigned the other way around...


>2. Character Shape Problem
>
>    I had different character shapes, because:
>    a) Ligatures
>       In complex scripts, in Devanagari for instance the
>       ZERO WITH JOINER should be used to prevent ligature
>       forming and normally join the characters.
>
>       Whether ligature forming will actually happen or not
>       is completely up to the font. If the font does have
>       the ligature,  it will be formed. The standard does
>       not define all the compulsory ligatures.
>
>       I was even thinking about putting ZERO WITH JOINER
>       after each character. But why we have ZERO WITH JOINER
>       at all? I think a ZERO WITH LIGATURE FORMER would
>       be better. In this case at least I would know that
>       a ligature may appear at that point.

The problem that ligatures are font dependent remains. We
cannot do Devanagari without some way of asking for ligatures,
and as long as we are not standardizing the *fonts* then this
problem remains as the final display will depend on the font.

This is an issue for all scripts for which specifying ligatures
(or preventing them) is either strictly required, or at least
a common practice.

In a latin example, using the 'fi' comaptibility character, or
using f and i with the a font that ligates, and an application
that enables this feature of the font will give two different
backing stores for the same appearance. Note that the Mac
character set has an 'fi' in it - so this is not at all unique
to Unicode.

>     b) Hidden Marks
>       It is possible to make a combining mark, like a
>       negation mark appear in the base characters body
>       making it invisible. It is nearly impossible to
>       test the rendering engine for all possible
>       combinations.

This is no different from any other forms of spoofing.
You could use A, A and A, where each are from the
Latin, Greek and Cyrillic script, for example. Or,
you could use a font where 1 and l or I and l, or even
O and 0 look the same, and then you can get the same
result in ASCII.


>3. Text Search Problem
>
>     It is possible to create texts that look the same,
>     but the can not be searched because even when fully
>     decomposed and ordered they will be different.

I think this is not a new category, but the summary of
cases a, b, and c, above.

I've tried to show that many of the examples are related to
the fact that the script in question does not follow these
simple rules:

R1. Each symbol has a unique appearance
R2. Each symbol has an unchanging visual appearance
R3. Each symbol has a deterministic location in the output

Requiring R1 eliminates practically all multilingual character
codes, including limited ones, such as ISO/IEC 8859-7 for
example (Latin/Greek).

Requiring R2 eliminates any and all scripts with certain
forms of character shaping, ligating, or conjunct formation
requirements. A fully secure system that cannot handle such
data is of limited use in a global economy.

Requiring R3 eliminates bidirectional scripts.

I'm not trying to negate that these are challenges for con-
structing secure systems. What I'm trying to get across is
that these issues are not caused by the character encoding,
but by what the encoding encodes. Therefore, the challenge
needs to be to find ways to address these security concerns
that do not disallow global or multilingual data.

Finding such ways would be a worthwhile discussion.
A./

Re: Unicode and Security

Reply via email to