RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

Arcane Jill Tue, 09 Dec 2003 08:51:46 -0800

Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it)

int n = wcslen(L"café");

(That's int n = wcslen(L"café"); for those without HTML email)

The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that "wide character" means "Unicode character", but let's just assume that it does, for the moment).

So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format.

There is more to consider than just how and whether a text editor normalizes. If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. This implies that such a text editor should display NFD text as separate glyphs for each character.

On the other hand, such a text editor must also acknowledge that "é" and "e + U+0301" are actually equivalent. The intention of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. So in other contexts, is should display them the same.

Yuk. That's a lot to think about for anyone considering writing a programmers' text editor with serious Unicode support.
Jill

-----Original Message-----
From:    Philippe Verdy [mailto:[EMAIL PROTECTED]]
Sent:   Tuesday, December 09, 2003 2:04 PM
To:   [EMAIL PROTECTED]
Cc:   [EMAIL PROTECTED]
Subject:   RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

I would not like to use any Unicode plain-text editor that implicitly
normalizes the text without asking me, to work on programming source
files or XML or HTML files. But I will accept it, if the editor really
understands the language or XML syntax (and exhibits it to the user with
syntax coloring).

RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to