Re: Nicest UTF

D. Starner Fri, 10 Dec 2004 17:22:51 -0800

"Marcin 'Qrczak' Kowalczyk" writes:

> "D. Starner" writes: 
>
> > This implies that every programmer needs an indepth knowledge of 
> > Unicode to handle simple strings. 
> 
> There is no way to avoid that.


Then there's no way that we're ever going to get reliable Unicode
support. 
 
> If the runtime automatically performed NFC on input, then a part of a 
> program which is supposed to pass a string unmodified would sometimes 
> modify it. Similarly with NFD.

No. By the same logic you used above, I can expect the programmer to
understand their tools, and if they need to pass strings unmodified,
they shouldn't load them using methods that normalize the string.
 
> You can't expect each and every program which compares strings to 
> perform normalization (e.g. Linux kernel with filenames). 

As has been pointed out here, Posix filenames are not character strings; 
they are byte strings. They quite likely aren't even valid UTF-8 strings.

> > So S should _sometimes_ match an accented S? Again, I feel extended misery 
> > of explaining to people why things aren't working right coming on. 
> 
> Well, otherwise things get ambiguous, similarly to these XML issues. 

Sometimes things get ambiguous if one day &#349; is matched by s and one
day &#349; isn't? That's absolutely wrong behavior; the program must serve
the user, not the programmer. 's' cannot, should, must not match '&#349;';
and if it must, then it absolutely always must match '&#349;' and someway
to make a regex that matches s but not &#349; must be designed. It doesn't
matter what problems exist in the world of programming; that is the
entirely reasonable expectation of the end user.

> Does "\n" followed by a combining code point start a new line? 

The Standard says no, that's a defective combining sequence.

> Does 
> a double quote followed by a combining code point start a string 
> literal? 

That would depend on your language. I'd prefer no, but it's obvious
many have made other choices.

> Does a slash followed by a combining code point separate 
> subdirectory names?

In Unix, yes; that's because filenames in Unix are byte streams with
the byte 0x2F acting as a path seperator.
 
> It's hard enough to convince them that a 
> character is not the same as a byte. 

That contradicts you above statement, that every programmer needs an
indepth knowledge of Unicode.

> In case I want to circumvent security or deliberately cause a piece of 
> software to misbehave. Robustness require unambiguous and simple rules. 

The rules you are offering are only simple and unambiguous to the programmer;
they appear completely random to the end user. To have &#8814; sometimes start a
tag means that a user can't look at the XML and tell whether something opens
a tag or is just text. You might be able to expect all programmers, but you
can't expect all end users to.
-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm

Re: Nicest UTF

Reply via email to