from:"Marco Cimarosti"

[OT?] Uniscribe for Malayalam and Oriya

2004-12-21 Thread Marco Cimarosti

Hallo everybody. I hope this is not (too) OT. In case it is, can somebody
please redirect me to a more appropriate forum?

I am trying to display Malayalam and Oriya on an MS Win2000 system. The
fonts I am using are OpenType and have all glyphs and tables needed by thee
scripts, but I don't still see the glyphs come out correctly.

As not even *reordering* is done, I guess that my Uniscribe DLL does not
support these scripts. Are they implemented in newer versions of Uniscribe?
If yes, where can I get it?

Thanks in advance for any help.

--
Marco Cimarosti

[OT] The nice thing about standards...

2004-10-20 Thread Marco Cimarosti

Hallo everybody! I received this in the mail, and I thought it could be of
interestfor the Unicode mailing lits:

Aragonese - Lo geno d'as normas ye que aiga tantas entre ras que se puede
eslexir.
Asturian - Lo bono de les normes ye qu'hai munches onde escoyer.
Basque - Arauen alderik onena da aukeratzeko horrenbeste izatea.
Brazilian Portuguese - A coisa boa dos standards  a sua incrvel variedade.
Bresciano - El bl de gli standar l' che ghe n' tach.
Breton - Ar pep gwella a-fed ar standardo eo o liested.
Calabrese - La cosa bella tra li standard  ca ci ni su tanti ca putimu
scegli.
Catalan - El bo de les regles s que n'hi tantes entre les quals escollir.
Croatian - Zgodno u standardima jest da ih je toliko mnogo na izbor.
Dutch - Het leuke van normen is dat er eindeloos veel zijn om uit te kiezen.
English - The nice thing about standards is that there are so many of them
to choose from.
Esperanto - Bona afero pri normoj estas ke ekzistas tiom da ili por elekti.
Estonian - Normide juures on kige toredam see, et neid on nii palju - ole
mees ja vali!
Finnish - Standardien hyv puoli on siin, ett niiss on niin paljon
valinnanvaraa.
Flemish - Het leuke aan normen is dat je zo'n uitgebreid assortiment hebt om
uit te kiezen.
French - La plus grande qualit d'un standard est sa multiplicit.
Friulian - La robe biele dai standard j che 'a son tancj par sielgiju.
Galician - O bo das normas  que hai moitas para escoller.
German - Das Beste an Standards ist ihre groe Auswahl.
Griko - To prama rrio attus standard ene ka echi tossu amsa is pu sozzi
jaddhzzi.
Hungarian - Az a j dolog a szabvnyokban, hogy olyan sok kzl lehet
vlasztani.
Italian - La cosa bella degli standard  che ce ne sono tanti tra cui
scegliere.
Judeo-Spanish - Lo bueno de las normas es ke son tantas. Ke se puede eskojer
entre eyas.
Latin - Bonum rationum est quod plurimae sunt inter quas eligere possumus.
Limburguese - Wo zoe hnnig s on norme s daste ter zovel van hbs vr aut
te kieze.
Modenese - Al bl di mod l' c'a gh 'na bla slta.
Napolitan - 'O bbello r''e mudille  ca se ponno scegliere mmiez'a ttante.
Papiamento - E kos bon di normanan ta ku tin hopi fo'i kua por skohe.
Parmisan - Al bel di standard l' ch' a gh'n na mucha pr poder cernir.
Piedmontese - Ln ch'a l' bel d ij stndard a l' ch'as peul serne antre
vaire.
Polish - Zaleta standardw jest to, ze bedac licznymi, jest z czego
wybierac.
Portuguese - O bom das normas  que h moitas por onde escoller.
Romagnolo - Quel che da gost t'le banl l' ch'u i un foto.
Romano - Er bello delle regole  che ce ne so' 'n sacco tra cui pote' sceje.
Sicilian - 'U bellu d' 'i standard  ca ci nn' un mari ammenzu a cu'
scegliri.
Spanish - Lo bueno de las normas es que hay tantas entre las que se puede
elegir.
Swedish - Det trevliga med standarder r att det finns s mnga att vlja
p.
Venetian - El belo dei 'standard' xe che ghen' un saco par far na selta.

(More here, of course in Unicode:
http://www.verba-volant.net/pls/vvolant/datecorps2.dictionary?data=20-OCT-04
)

--
Marco

RE: UTF to unicode conversion

2004-06-30 Thread Marco Cimarosti

Mike Ayers wrote:
 Side 1 (print and cut out): 
 ++---+---+--+ 
 | U+ | yy zz |Cima's UTF-8 Magic | Hex= | 
 | U+007F | !  !  |Pocket Encoder | B-4  | 
 | YZ | .  .  |   |  | 
 ++---+---+ Vers. 1.0 | 0=00 | 
 | U+0080 | 3x xy | 2y zz | 16 March 2000 | 1=01 | 
 [...]

Holy Hermes Trismegistos, I had forgotten this one! How's it I had all that
free time back in March 2000? :-)

_ Marco

RE: lines 05-08, version 4.7 of Roadmap to BMP and 'Hebrew extens ions'

2004-06-28 Thread Marco Cimarosti

Rick McGowan wrote:
  I mistakenly thought Tifinagh was rtl.
 
 That's OK. It has been, and sometimes still is, written right 
 to left, hence it was roadmapped in a right-to-left
 allocation block. However, in modern usage, and in the
 Moroccan national standard now being drafted, it  
 is specifically left to right.

Is the draft of this Moroccan standard on-line somewhere?

TIA.

_ Marco

RE: Bob Bemer, father of ASCII, has died

2004-06-25 Thread Marco Cimarosti

[\]{}

RE: Latin long vowels

2004-06-23 Thread Marco Cimarosti

Anto'nio Martins-Tuva'lkin wrote:
 On 2004.06.22, 16:20, Marco Cimarosti wrote:
 
  You can also compose them with the normal letter followed by
  character MODIFIER LETTER MACRON (code 02C9, decimal 713).
 
 Oops! You mean U+0304 : COMBINING MACRON (decimal: 772).

Yes, right, sorry.

(Hey, that's why it didn't wanto to go on top of the bloody vowel! :-)

_ Marco

RE: Latin long vowels

2004-06-22 Thread Marco Cimarosti

Joe Speroni wrote:
 I apologize for a simple question, but after a few hours of
 research I don't seem to be able to find the characters needed.

Funny: I see them in my Windows Character Map utility at the first hit on
Page Down key...

 I'm trying to scan a Latin text that uses a bar over the vowels
 to indicate long sounds.  Do these characters exist in Unicode?

Uppercase: 0100, 0112, 012A, 014C, 016A (decimal: 256, 274, 298, 332, 362)
Lowercase: 0101, 0113, 012B, 014D, 016B (decimal: 257, 275, 299, 333, 363)

You can also compose them with the normal letter followed by character
MODIFIER LETTER MACRON (code 02C9, decimal 713).

 If so, would anyone know from where a Windows XP font
 containing these five characters could be download? 

Several fonts which come pre-installed in Windows NT, 2000 or XP have those
characters, e.g. Arial, Times New Roman and Courier.

 Aloha,

Aloha?

Is it for Hawaiian that you need the macrons? In that case also notice that,
afaik, the proper character for the glottal stop letter (aka apostrophe or
okina) is MODIFIER LETTER TURNED COMMA (code 02BB, decimal 699).

_ Marco

RE: [OT] Even viruses are now i18n!

2004-04-23 Thread Marco Cimarosti

Antoine Leca wrote:
 The virus cannot have any knowledge of a language code. And 
 much less of the language used by its next victim...

It sends e-mails to addresses stolen from the previous victim's address
list, so it can analyze the top-level domain of these addresses (.it,
.fr, etc.). Although, strictly speaking, these domains normally correspond
to *country* codes, they are a pretty good hint of the language of the next
victim.

_ Marco

[OT] Even viruses are now i18n!

2004-04-22 Thread Marco Cimarosti

It seems that even the virus industry is getting global!

F-Secure Virus Descriptions : NetSky.X
[...]
Netsky.X sends messages in several different languages: English, Swedish,
Finnish, Polish, Norwegian, Portuguese, Italian, French, German and possibly
the language of some small island called Turks and Caicos, located in the
Atlantic ocean.
[...]
According to whether the destination address is one of the following
domains: 
[...]
It will compose messages in the corresponding language, choosing from the
following parts. 
Bodies chosen from: 
[...]
mutlu etmek okumak belgili tanimlik belge.
Behaga lsa dokumenten.
Haluta kuulua dokumentoida.
Podobac sie przeczytac ten udokumentowac.
Behage lese dokumentet.
Leia por favor o original.
Legga prego il documento.
Veuillez lire le document.
Bitte lesen Sie das Dokument.
Please read the document.

(from http://www.f-secure.com/v-descs/netsky_x.shtml)

_ Marco

RE: [OT] Even viruses are now i18n!

2004-04-22 Thread Marco Cimarosti

Peter Kirk wrote:
 mutlu etmek okumak belgili tanimlik belge.
 ...
 
 This is Turkish, of a sort. The virus writers have presumably 
 confused .tc and .tk, as this Turkish is the first body listed
 and .tc is the first domain listed.

Yes, and the translation was probably done translating word by word the
English sentence Please read the document.

The English word the was translated as belgili tanmlk, which actually
means definite article!

_Marco

RE: help finding radical/stroke index at unicode.org

2004-04-15 Thread Marco Cimarosti

Gary P. Grosso wrote:
 Judging by what we saw in the back of the Unicode 2.0 book,
 we would tend to say that it is correct that (in an index)
 21333 (0x5355) is sorting under 21313 (0x5341) instead of 
 20843 (0x516b).  I am looking for some table of radicals
 that I can show our customer to help support that claim.
 
 Perhaps I should start by asking for opinions on the above
 sorting, and for guidelines on how best to govern such
 decisions, [...]

As Ken Whistler said, you don't necessarily have to make such a decision.

The usual policy in a dictionary-like radical/stroke index is to put
ambiguous characters under *multiple* radicals, so that they are easily
found whatever the reader's assumption. A computer radical/stroke search
utility is supposed to be at least as user friendly as old paper indices.

Please notice that the Unihan.txt database contains most of the raw data you
need to build such a comprehensive index. The data is contained in these
fields:

kRSUnicode
kRSJapanese
kRSKanWa
kRSKangXi
kRSKorean

kRSUnicode is Unicode's default radical/stroke index (the one which was
used assign the code point to CJK characters), while the other ones are
alternate radical/stroke from a variety of sources.

E.g., for character U+5355 ( = lone), Unihan.txt contains the following
kRS... entries:


U+5355  kRSKangXi   12.6
U+5355  kRSUnicode  24.6

These entries tell you that while Unicode puts U+5355 under the 24th radical
(U+2F17 = U+5341 =  = ten), the Kang Xi Zidian dictionary puts it under
the 12th radical (U+2F0B = U+516B =  = eight).

Basically, if you extract all the kRS... fields, ignore the field
identifier, sort them by their radical/stroke index, discard duplicates
(i.e., entries with the same index and code point), and you obtain a list
quite close to a dictionary-like index with ambiguous characters under
multiple radicals.

_ Marco

RE: Unicode 4.0.1 Released

2004-03-31 Thread Marco Cimarosti

Rick McGowan wrote:
 Unicode 4.0.1 has been released! [...]
 The main new features in Unicode 4.0.1 are the following:
 [...]
 3. Unicode Character Database:
 [...]
   * Changed: general category of U+200B ZERO WIDTH SPACE
   * Changed: bidi class of several characters

(If I am asking a FAQ, I apologize in advance...)

So far, my understanding was that the normative properties of existing code
points where carved in stone.

Won't these fixes break applications out there? I.e., won't they turn
previously conformant applications into non conformant ones?

_ Marco

RE: help needed with adding new character

2004-03-19 Thread Marco Cimarosti

Michael Everson wrote:
 What organization uses the ANARCHY SYMBOL? ;-)

The anarchist movement. Why are you winking?

Ciao.
Marco

[OT] Freedom and organization (was RE: help needed with adding ne w character)

2004-03-19 Thread Marco Cimarosti

Kenneth Whistler wrote:
 Why is an Anarchist asking to standardize something?

Why not!? Can you elaborate on this? Myself, I am an anarchist sympathizer,
and I have been deeply interested in a character encoding standard for
nearly ten years now...

Anarchism is against imposing forms of organization, non against
organization itself. And standards are quite like the useful side of laws
(the organization) without the harmful side (the imposition), so they should
be welcome to anarchists.

Obedience to laws can only be imposed with violence by a parasitical clique
(cops, tribunals), whereas compliance to standards is achieved solely by the
intrinsic usefulness and quality of the standard's design.

_ Marco

RE: help needed with adding new character

2004-03-19 Thread Marco Cimarosti

Jon Wilson wrote:
 I disagree that the anarchy symbol is not a character used in the 
 representation of words. I can write a word beginning with A with 
 either a simple LATIN CAPITAL LETTER A, or with an Anarchy symbol, or 
 with an existing CIRCLED LATIN CAPITAL LETTER A.

You can also write an M using the Macdonald logo (I have seen it on actual
Macdonald's advertising), or an I using a jumping table lamp (I have seen
it in a cartoon by Pixar).

But we don't need to allocate Unicode code points for the Macdonald logo or
for a jumping table lamp, do we?

Such things are not independent letters, but just graphic variations of the
ordinary letters A, M, I, made with the purpose of adding some kind of
overtone (ideological, commercial, humoristic).

 I also disagree that the Anarchy symbol has no use within a 
 text. I do not doubt that I can find examples of published
 texts where the anarchy symbol is used throughout. Beware of
 saying that isn't real text just because the character
 isn't currently in Unicode. The code should represent usage,
 not the other way round. I understand that finding such 
 text is probably crucial to a successful application.

Yes, I think that this is THE very point that you have to demonstrate before
filing your proposal. And I bet that the success of your proposal will
depend almost entirely on how good this demonstration is.

The point is not so much to demonstrate that the symbol exists (that's quite
obvious: we've all seen it), or that it is unique (that's quite obvious too,
IMHO: it is both graphically and semantically different from the current
Unicode circled A): the point is to demonstrate that it is *TEXT*, i.e. that
some piece of text could not be encoded without it.

I am a subscriber of at least two anarchist magazines (printed on *paper*,
so Unicode digital encodings are not at issue here), and I don't recall
having ever seen an anarchy symbol used *within* the body text of these
magazines. For sure, the symbol is ubiquitous *near* the text: it is used as
a logo on the magazines' title or in third party's advertising; it is used
as typographic decoration; it appears on the flags in the photos of rallies
and demonstrations... But, as far as I can recall, it is never uses as part
of a sentence.

Of course, knowing about your proposal, I will look with doubled attention
all next issues, and I will send you any specimen of the symbol used as
text. But I am quite skeptical.

_ Mrco

RE: [OT] Freedom and organization (was RE: help needed with addin g ne w character)

2004-03-19 Thread Marco Cimarosti

Peter Kirk wrote:
 Come to think of it, a not very large group of them with a
 bit of money behind them could buy enough votes to outvote
 the corporations and destroy Unicode - 

Yes, right, interesting possibility! Not that much money either: a single
punk rock concert would probably raise enough funds.

But, before I propose this to the World Wide Anarchist Consortium, could you
please mention at least valid motive for the action? They would object that,
as the anarchists are by nature anti-nationalistic and internationalist, the
anarchist on-line forums and web sites tend to be multi-language, and
Unicode is quite an useful tool for that.

_ Marco

RE: OT? Languages with letters that always take diacriticals

2004-03-16 Thread Marco Cimarosti

Curtis Clark wrote:
 Are there any languages that use letters with diacriticals, 
 but *never* use the base letter without diacriticals?

AFAIK, Thaana is such a case.

Unlike Indic scripts, Thaana has no inherent vowel, so each consonant letter
always takes either a vowel mark or the sukuun (= no-vowel mark).

_ Marco

RE: Web Form: Other Question: Etruscan,Sanscrit Linear B on ibo ok G4

2004-01-28 Thread Marco Cimarosti

John Jenkins wrote:
 Anybody understand what he means by there is unicode gamma of 
 characters but it is not complete?

I guess unicode gamma of characters is Italinglish for Unicode character
set.

(Italian gamma means repertoire, range, scale, set.)

_ Marco

RE: Unicode forms for internal storage - BOCU-1 speed

2004-01-23 Thread Marco Cimarosti

Jon Hanna wrote:
 I refuse to rename my UTF-81920!

Doug, Shlomi, there's a new one out there!

Jon, would you mind describing it?

_ Marco

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Peter Kirk wrote:
 This one also looks dangerous.

What do you mean by dangerous? This is an heuristic algorithm, so it is
only supposed to work always but only in some lucky cases.

If lucky cases average to, say, 20% or less then it is a bad and useless
algorithm; if they average to, say, 80% or more, then it is good and
useless. But you can't ask that it works in the 100% of cases, or it
wouldn't be heuristic anymore.

 Some scripts include their own 
 digits and punctuation; not all scripts use spaces; and controls are not 
 necessarily used, if U+2028 LINE SEPARATOR is used for new lines.

Yes, but *all* these circumstances must occur together in order for the
algorithm to be totally useless for *that* language.

If a certain Unicode plain text file uses ASCII punctuation OR spaces OR
end-of-line characters, AND the file is not too short or has a very odd
formatting, then the algorithm should work.

 But there may be some characters U+??00 which are used rather 
 commonly in a particular script and so occur commonly in
 some text files.

And those text files will not be detected correctly, particularly if they
are very short: that's part of the game.

_ Marco

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Jon Hanna wrote:
 False positives can be caused by the use of U+ (which is 
 most often encoded as 0x00) which some applications do use
 in text files.

I have never seen such a thing, can you make an example?

I can't imagine any use for a NULL in a file apart terminating records or
strings but, of course, a file containing records or string is not what I
would call a plain-text file, anyway not a typical plain-text file.

 The method can be used reliably with text files that are 
 guaranteed to contain large amounts of Latin-1

But the Latin-1 (or even just ASCII) range contains some characters which
are shared by most languages (space, new line and/or line feed, digits,
punctuation), so there should be a relatively large amount of Latin-1
characters in most cases.

Even scripts which have their own digits or punctuation often prefer
European digits punctuation, especially in computer usage. E.g., it suffices
to check a few websites (or even printed matter) in Arabic to see that
European digits are much more widespread than native digits.

_ Marco

RE: Detecting encoding in Plain text

2004-01-13 Thread Marco Cimarosti

Peter Kirk wrote:
 What do you mean by dangerous? This is an heuristic
 algorithm, so it is only supposed to work always [...]

(I meant: it is not supposed to work always)

 I would not consider an 80% algorithm to be very good - 
 depending on the circumstances etc. But if for example 20% of
 my incoming e-mails were detected with the wrong encoding and
 appeared on my screen as junk, [...]

In this case (as in most other similar cases), you should rather blame the
people who send you e-mail without encoding declaration.

Auto-detection should be the last resort, when you have no safest way of
determining the encoding.

 Yes, but *all* these circumstances must occur together in 
 order for the algorithm to be totally useless for *that*
 language. [...]
 
 True. But there may be certain languages (perhaps Thai?) for 
 which all of these circumstances regularly occur together.

I don't think that Thai would be such a case. Thai normally uses European
digits (the usage scope of Thai digits is probably similar to that of Roman
numerals in Western languages), some European punctuation (parentheses,
exclamation marks, hyphens, quotes), and spaces (although a Thai space has
the strength -- and hence the frequency -- of a Western semicolon).

As a minimum, all languages should use line feed and/or new line as line
terminators, as Unicode's line and paragraph separators never caught on.

_ Marco

RE: Chinese rod numerals

2004-01-13 Thread Marco Cimarosti

Christopher Cullen wrote:
 (2) The Unicode home page says: The Unicode Standard defines 
 codes for characters used in all the major languages [...]
 mathematical symbols, technical symbols, [...].
 I suggest that in an enterprise so universal and
 cross-cultural as Unicode, the definition of what counts 
 as a mathematical symbol has to be conditioned by actual 
 mathematical practice in the culture whose script is being
 encoded.

I think that Ken Whistler point was simply this:

OK, Chinese rod numerals may be symbols, but were these symbols used
in *writing*?

Not all symbols are used in writing, and only symbols used in writing are
suitable to be part of a repertoire of, well, encoding symbols used in
writing...

A flag, a medal, a tattoo, T-shirt may definitely be calle4d symbols, yet
Unicode does not need a code point for Union Jack or Che Guevara
T-Shirt.

To stick to mathematics, a pellet on an abacus, a key on an electronic
calculator, or a curve drawn on a whiteboard may legitimately be considered
symbols for numbers or other mathematical concepts. Yet, Unicode does not
need a code point for abacus pellet, or memory recall key, or hyperbola
with horizontal axis, because these symbols are not elements of writing.

IMHO, in your proposal you should provide evidence that the answer to the
above question is yes. I.e., you don't need to prove that these symbols
were used in Chinese mathematics, but rather that they were used to *write*
something (numbers, arguably, or arithmetical operations, etc.).

_ Marco

RE: Detecting encoding in Plain text

2004-01-12 Thread Marco Cimarosti

Doug Ewell wrote:
 In UTF-16 practically any sequence of bytes is valid, and since you
 can't assume you know the language, you can't employ distribution
 statistics.  Twelve years ago, when most text was not Unicode and all
 Unicode text was UTF-16, Microsoft documentation suggested a heuristic
 of checking every other byte to see if it was zero, which of course
 would only work for Latin-1 text encoded in UTF-16.

I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
this method was suggested first by Microsoft: to me, it seems quite
self-evident.

It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.

The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof
negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless
at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is
detecting sequences of two consecutive zero bytes, the first of which having
an odd index: as it is very unlikely that a UTF-16 file would a NULL
character, zero 16-bit words must be part of a UTF-32 character. The
combination of these two methods is pretty enough to tell apart UTF-16 and
UTF-32.

Once you determined whether the file is in UTF-16 or in UTF-32, a
statistical analysis of the *indexes* of zero bytes should be pretty enough
to determine the UTF's endianness. UTF-16 is likely to be little-endian if
zero bytes are more frequent at even indexes than at odd indexed, and vice
versa. This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form U+??00. For UTF-32,
determining endianness is even simpler: if *all* bytes whose index is
divisible by 4 are zero, then it is little-endian, else it is big-endian.

Of course, all this works only if it is true the basic assumption that the
file is a plain text file: this method is not quite enough for telling apart
text files from binary files.

_ Marco

RE: Punched tape (was: Re: American English translation of chara cter names)

2004-01-07 Thread Marco Cimarosti

Anto'nio Martins-Tuva'lkin wrote:
 |O OoOO  |
 |O  oOOO |
 | OOo O O|
 |OO oOO  |
 | O o  OO|
 |OO o|
 |O Oo OO |
 |O  o|
 | OOo OOO|
 |O OoO   |
 | OOo OOO|
 |O  o OOO|

«\N5l#oVO7X7G»?

_ Marco

RE: Unicode-ASCII approximate conversion

2003-12-19 Thread Marco Cimarosti

Hallvard B Furuseth wrote:
 I need a function which converts Latin Unicode characters to 
 the closest equivalent ASCII characters, e.g. é - e.
 
 Before I reinvent the wheel, does any public domain or GPL 
 code for this already exist?

I don't know, sorry.

 If not,
 for the most part I expect I can make the mapping from the character
 names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
 in ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.

Why the name!?

The decomposition property (5th filed on each line) is much better for this.
E.g.:

00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301N;LATIN
SMALL LETTER E ACUTE;;00C9;;00C9

The decomposition field tells you that é (code 00E9 hex) is composed of
ASCII e (code 0065 hex) and the combining acute accent (code 0301 hex):
you keep the ASCII character and drop the composing accent.

 Punctuation and other non-letters will be worse, but they are less
 important to me anyway.

The result is much better if you allow the ASCII conversion to be a string.
This allows you to, e.g., © = (c), ½ = 1/2, and so on. This is also
good for letters: ß = ss, å = aa, etc.

_ Marco

RE: American English translation of character names

2003-12-18 Thread Marco Cimarosti

John Cowan wrote:
 In the New York City subway system (of underground trains, that is,
 not underground pedestrian tunnels!), this letter has been 
 consistently avoided since 1967, when the system of distinguishing trains
 by letter or number was instituted.  The only other letters never used are
 I and O (presumably to avoid confusion with 1 and 0, though 0 has never
 been used either), and Y.  Why Y is a mystery to me: perhaps there has
 simply never been a need for it.

Probably, having to get train Why? to reach one's workplace could have a
negative effect on employees' attitude towards hard working.

_ Marco

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Marco Cimarosti

Doug Ewell wrote:
 I'll go farther than that.  It's always bothered me that speakers of
 European languages, including English but especially French, have seen
 fit to rename the cities and internal subdivisions of other countries.

Rightly said!

There is reason to rename Colonia to Kln, Augusta to Augsburg,
Eboraco to York, Provincia to Provence, and so on.

_ Marco

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-17 Thread Marco Cimarosti

Michael Everson wrote:
 At 11:04 +0100 2003-12-17, Marco Cimarosti wrote:
 
 There is reason to rename Colonia to Köln, Augusta to 
 Augsburg,
 Eboraco to York, Provincia to Provence, and so on.
 
 Nicely said. Subtle irony tends to go over some 
 people's heads on this list though.

Especially if one forgets an essential no. :-(
It should have been There is NO reason to rename...

 Eboraco is called Eabhrac in Irish. :-)

So, that's who set the bad example in the first place! When the Angles came
they said: if Britanni can mangle place names, why shouldn't Ingevones? :-)

Ciao.
Marco

RE: Arabic Presentation Forms-A

2003-12-17 Thread Marco Cimarosti

Philippe Verdy wrote:
  #code;cc;nfd;nfkdFolded; # CHAR?; NFD?; NFKDFOLDED?;
  # RIAL SIGN
  fdfc;;;isolated 0631 06cc 0627 0644; # ??; ?; ?;
  
  The Arial Unicode MS font does not have a glyph for the 
 Rial currency sign so I won't comment lots about it, even if 
 it's a special ligature of its component letters:
  - where the medial form of U+06CC ARABIC LETTER FARSI YEH 
 (?) is shown on charts only as two dots (and not with its 
 Arabic letter alef maksura base form, as the comment in 
 Arabic chart suggests for Arabic letter yeh), which is

I am not sure I understand what you are asking, but it is quite normal that
the initial and medial form of letters Beh, Teh, Theh, Noon and Yeh loose
their tooth and are thus recognizable only by their dots. Similarly, Seen
and Sheen often loose their three teeth.

I find this particularly puzzling with the initial and medial forms of Seen,
which becomes a simple straight line in most calligraphic styles.

  - located on below-left of the medial form of U+0627 (?) ,

U+627 is Alif, so it has no medial form.

  - and where the initial form of U+0631 (?)  kerns below its 
 next two characters (sometimes with an aditional kashida 
 below its next three characters).

This too is quite normal: the tail of Reh, Zain and Waw often kerns below
the next letter. Compare it to Latin lowercase j, which has a similar
behavior.

_ Marco

RE: [OT] CJK - CJC (Re: Corea?)

2003-12-15 Thread Marco Cimarosti

Doug Ewell wrote:
 This seems very misguided, if true.  Alphabetical primacy can 
 hardly be considered an effective measure of the relative
 power or importance of a nation.
 [...]
 Remember that in the time frame in question, the late '30s and early
 '40s, three of the major world powers were the United States, 
 the United Kingdom, and the Soviet Union ( 
  ).  These countries, beginning 
 with
 U, U, and S in their respective national languages,
 were unlikely to attach much significance to the relative
 alphabetical order of Japan and Korea.

Right.

And what about the Chinese who, back in the 1950's, decided use digraph zh
for the first consonant of the country's name? It seems that they didn't
care too much that Zhongguo would have been last, after Zimbabwe.

On the other hand, both Choson and Han'guk come before Nippon so, if
the goal is to have the Koreas listed before Japan on the score boards of
the Olympic games, it is enough to use the local names.

But, BTW, aren't score boards sorted by score rather than by name? So, an
even simpler solution is... winning more medals.

_ Marco

[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))

2003-12-12 Thread Marco Cimarosti

Tim Greenwood wrote:
 In my interpretation of the C standard (which I am reading from 
 http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a 
 valid wchar_t encoding if your execution character set contains 
 characters outside the C0 controls and Basic Latin range, and 
 UTF-16 is not a valid wchar_t encoding if your execution character
 set has characters outside the BMP. In other words whatever you 
 consider to be a character (which may be a combining character)
 must be encoded in one wchar_t code unit.
 
 The relevant passage is
 
 11 A wide character constant has type wchar_t, an integer 
 type defined in the stddef.h header. The value of a wide character
 constant containing a single multibyte character that maps to a
 member of the extended execution character set is the wide
 character (code) corresponding to that multibyte character, as
 defined by the mbtowc function, with an implementation-defined
 current locale. The value of a wide character constant containing
 more than one multibyte character, or containing a multibyte
 character or escape sequence not represented in the extended
 execution character set, is implementation-defined.

I don't know. I thought a bit about this, and I think that your restrictive
interpretation is not necessarily correct.

After all, the C Standard just says is that a wide character and a
multibyte character is whatever the mbtowc function defines them to be.

And it is quite easy to show that the mbtowc function could, in turn,
define them to be whatever the mbrtowc function defines them to be:

   // My hypothetical mbtowc.c
   #include wchar.h
   // (See ISO/IEC 9899:1999 - 7.20.7.2 The mbtowc function)
   int mbtowc (wchar_t * pwc, const char * s, size_t n)
   {
  int retval;
  static mbstate_t internal;
  if (s == NULL)
  {
 // yes: we are stateful (or pretend we are)
 return 1;
  }
  retval = (int)mbrtowc(pwc, s, n, internal);
  if (retval  0)
  {
 retval = -1;
  }
  return retval;
   }

As the definition of multibyte characters and wide character is now
completely up to the mbrtowc, we could well adopt the convention (or call
it trick, if you prefer) of pretending that a 4-byte UTF-8 multibyte
sequence is actually a sequence of *two* 2-byte multibyte sequences.

Technically, the trick is possible because:

a) returning 2 twice instead than 4 once guarantees the correct
advance while scanning a string;
b) we can actually map both our fake 2-byte multibyte sequences to
an actual wide character: the high and low surrogates;
c) the mbstate_t object can be used to store the relevant data
across the two calls.

Legally, the trick is possible because of the purposely vague wording of the
C Standard, which leaves the definition of wide and multibyte characters
completely up to the implementation.

Here is what I mean:

   // Excerpt from my hypothetical wchar.h for UTF-16 wide characters
   // ...
   // (See ISO/IEC 9899:1999 - 7.17 Common definitions stddef.h)
   typedef short wchar_t;
   // ...
   // (See ISO/IEC 9899:1999 - 7.24 Extended multibyte and wide character
utilities wchar.h)
   typedef wchar_t mbstate_t;
   // ...

   // My hypothetical mbrtowc.c for UTF-16 wide characters
   #include wchar.h
   // (See ISO/IEC 9899:1999 - 7.24.6.3.2 The mbrtowc function)
   size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
   {
  extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32);
  extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16);
  static mbstate_t internal = 0;
  long c32;
  int retval;
  if (ps == NULL)
  {
 ps = internal;
  }
  if (s == NULL)
  {
 pwc = NULL;
 s = ;
 n = 1;
  }
  if (*ps != 0)
  {
 if (pwc != NULL)
 {
// output second surrogate saved in previous call
*pwc = *ps;
 }
 // clear saved surrogate
 *ps = 0;
 // return fake multibyte length
 return 2;
  }
  retval = _MyDecodeUtf8(s, n, c32);
  if (retval == 4)
  {
 // output first surrogate and save second surrogate for next call
 _MyEncodeUtf16(c32, pwc, ps);
 // return fake multibyte length
 retval = 2;
  }
  else if (retval = 0  pwc != NULL)
  {
 *pwc = (wchar_t)c32;
  }
  return retval;
   }

If the above UTF-16 implementation could perhaps look relatively smart, an
UTF-8 implementation would definitely look very silly.

However, if it we agree that defining what a multibyte character and a
wide character are the exclusive task of mbtowc (and hence of
mbrtowc), then the below implementation, silly as it is, could well be
100% compliant with C99:

   // Excerpt from my hypothetical wchar.h for UTF-8 (or DBCS, or SBCS, or
any byte-oriented encoding) wide characters
   // ...
   // (See ISO/IEC 9899:1999 -

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

 Hmm. Now here's some C++ source code (syntax colored as 
 Philippe suggests, to imply that the text editor understands 
 C++ at least well :enough to color it)
 
 int n = wcslen(Lcafé);
 
 (That's int n = wcslen(Lcafé); for those without HTML email)
 
 The L prefix on a string literal makes it a wide-character 
 string, and wcslen() is simply a wide-character version of 
 strlen(). (There is no guarantee that wide character means 
 Unicode character, but let's just assume that it does, for 
 the moment).

Even assuming that you can assume that wide characters are Unicode, you
have not yet assumed in what kind of UTF they are. (Don't assume I
deliberately making calembours :-)

The only thing that the C(++) standards say about type wchar_t is that it
is not smaller that type char, so a wide character could well be a byte,
and a wide character string could well be UTF-8, or even ASCII.

 So, should n equal four or five?

Why not six?

If, in our C(++) compiler, type wchar_t is an alias for char, and wide
character strings are encoded in UTF-8, and the é is decomposed, then n
will be equal to 6.

 The answer would appear to depend on whether or not the
 source file was saved in NFC or NFD format.

The answer is:

int n = wcslen(Lcafé);

That's why you take the burden to call the wcslen library function rather
than assuming a hard-coded value such as:

int n = 4;  // the length of string café

 There is more to consider than just how and whether a text 
 editor normalizes.

Whatever the editor does, what if then the *compiler* normalizes it?

The source file and the compiled object file are not necessarily in the same
encoding and/or normalization.

A certain compiler could accept a certain range of input encodings (maybe
declared with command-line parameter) and convert them all in a certain
internal representation in the compiler object file (e.g., Unicode expressed
in a particular UTF and with a particular normalization).

That's why library functions such as strlen or wcslen exist. You don't
need to bother what these functions will return in a particular compiler or
environment, as far as the following code is guaranteed to work:

const wchar_t * myText = Lcafé;
wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1));
if (myBuffer != NULL)
{
wcscpy(myBuffer, myText);
}

 If a text editor is capable of dealing with Unicode text,
 perhaps it should also be able to explicitly DISPLAY the
 actual composition form of every glyph.

Against, this is not possible nor desirable, because a text editor is not
supposed to know how the compiler (or its runtime libraries) will transform
string literals.

 The question I posed in the previous paragraph should 
 ideally be obvious by sight - if you see four characters, 
 there are four characters; if you see five characters, there 
 are five characters.

Provided that you can define what a character is... After a few years
reading this mailing list, I haven't seen a single acceptable definition of
character.

Moreover, I matured the impression that it is totally irrelevant to have
such a definition:

- as an end user, I am interested in a higher level kind of objects (let's
call them graphemes, i.e. those things I see on the screen and I can
interact with my mouse);

- as a programmer, I am interested in a lower lever kind of objects (let's
call them encoding units, i.e. those things that I count when I have to
allocate memory for a string, or the like).

The term character is in a sort of conceptual limbo which makes it pretty
useless for everybody, IMHO.

_ Marco

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

I (Marco Cimarosti) wrote:
  So, should n equal four or five?
 
 Why not six?
  ^^^

Errata: seven.

 If, in our C(++) compiler, type wchar_t is an alias for 
 char, and wide character strings are encoded in UTF-8, 
 and the é is decomposed, then n will be equal to 6.
 ^

Errata: 7.

Sorry.
 
_ Marco

RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)

2003-12-09 Thread Marco Cimarosti

Peter Kirk wrote:
  So, should n equal four or five? The answer would appear to
  depend on  whether or not the source file was saved in NFC
  or NFD format.
 
 No, surely not. If the wcslen() function is fully Unicode 
 conformant, it should give the same output whatever the
 canonically equivalent form of its input.
 That more or less implies that it should normalise 
 its input.

Standards and fantasy are both good things, provided you don't mix them up.

The wcslen has nothing whatsoever to do with the Unicode standard, but it
has all to do with the *C* standard. And, according to the C standard,
wcslen must simply count the number wchar_t array elements from the
location pointed to by its argument up to the first wchar_t element whose
value is L'\0'. Full stop.

 (One can imagine a second parameter specifying whether NFC or NFD is 
 required.)

One can imagine whatever (s)he wants, but should please avoid to claim that
his/her imagination corresponds to some existing standards.

 This makes the issue one not for the text editor 
 but for the programming language or its string handling library.

This is correct.

 The Unicode standard does allow for special display modes in 
 which the exact underlying string, including control
 characters, is made visible.

Can you please cite the passage where the Unicode standard would not allow
this?

_ Marco

RE: [OT]

2003-12-09 Thread Marco Cimarosti

 [...]
 some greedy investors turned it into a scam just for a quick buck (for
 surely it will be quick!)
 
 Sorry, I had to get that off my chest.  Hopefully someone 
 with some pull in Ireland will read this and do something
 about it :-)

Or simply flush Guinne$$ and drink Murphix. :-)

Ciao.
Marco

[OT] GB 18030 certification

2003-11-25 Thread Marco Cimarosti

I was wondering: what exactly does GB-18030 certification consists of?

I guess that some tests done on the software, but what exactly? Also, where
and who performs this certification? Does the Chinese government do it
directly, or is it out-sourced to external agencies? Does this have to be in
China or are offices available also abroad?

Thanks in advance for any info. Please reply off-line if you think the
matter is not of general interest for the list.

_ Marco

RE: Problems encoding the spanish o

2003-11-17 Thread Marco Cimarosti

Pim Blokland wrote:
 Not only that, but the process making the mistake of thinking it is
 UTF-8 also makes the mistake of not generating an error for
 encountering malformed byte sequences,

BTW, this process has a name: Internet Explorer.

 AND of outputting the result as two 16-bit numbers instead of one
 21-bit number.

I guess that this resulted by copying  pasting the resulting text in an
editor and saving it as UTF-16.

_ Marco

RE: Tamil conjunct consonants (was: Encoding Tamil SRI)

2003-11-07 Thread Marco Cimarosti

Peter Jacobi wrote:
 IMHO this doesn't fit well actual Tamil use and raises a lot 
 of practical problems. 
 
 Either there must be an accepted list of these ligatures (but lists of
 archaic usage tend to grow), or one is bound to put a preemptive ZWNJ
 after every SHA VIRAMA in modern use, to prevent conjunct consonant
 forming.
 
 If this archaic ligature problems extends to other grantha 
 consonants, even more preemptive ZWNJs are necessary for
 contempary Tamil.

Archaic ligatures are supposed to be present only in a font designed for
reproducing an archaic look. Those fonts should not be used for
typesetting modern Tamil.

There is nothing special with Tamil here: this would be true for any other
script.

E.g., if you typeset this English e-mail with a Fraktur OpenType font many
archaic ligatures might appear, such as ch or ss. Moreover, unexpected
contextual forms could appear: e.g., the s in special could look very
different from the s in ligatures (long s vs. short s).

ZWNJ's etc. should be inserted only in special cases, e.g. when the presence
or absence of a ligature would change the meaning of the word, or anyway
affect the meaning of the text.

_ Marco

RE: Encoding Tamil SRI

2003-11-06 Thread Marco Cimarosti

Peter Constable wrote:
  Alternatives given were
  (0BB8)(0BCD)(0BB1)(0BC0)
  (0BB6)(0BCD)(0BB1)(0BC0)  (if and when U+0BB6 becomes Unicode)
  (0B9A)(0BBF)(0BB1)(0BC0)
 
 Alternatives to what? The first and third sequence would have distinct
 appearances (see attached file), and would consistute distinct
 spellings. The second cannot be evaluated without knowing what they
 intend 0BB6 to be.

U+0BB6 = TAMIL LETTER SHA (see
http://www.unicode.org/alloc/Pipeline.html).

_ Marco

Re-distributing the files in http://www.unicode.org/Public/MAPPIN GS/VENDORS

2003-11-05 Thread Marco Cimarosti

Is it allowed to re-distribute, in a commercial application, the Microsoft
and Apple mapping files available from the Unicode server?

I am talking about the files published in the following directories:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT

http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE

The header comment of mapping files in other directories under
http://www.unicode.org/Public/MAPPINGS often contain explicit instruction
for re-distribution, such as:

Unicode, Inc. hereby grants the right to [...] make copies of this
file in any form for internal or external distribution as long as this
notice remains attached;

Unicode, Inc. specifically excludes the right to re-distribute this
file directly to third parties or other organizations whether for profit or
not.

It would be nice to have similar statements also from Microsoft and Apple,
either in the header of the mapping files themselves or as separate read-me
files in the relevant directories.

If this is not possible, could I please have a public or private reply from
the contact e-mails indicated in the files ([EMAIL PROTECTED] and
[EMAIL PROTECTED], respectively), or from the Unicode Consortium official,
stating whether o not I am allowed to re-distribute the above described
files in a commercial application?

Thank you in advance. Regards.

Marco Cimarosti
(S3, Italy, http://www.essetre.it)

RE: GDP by language

2003-10-23 Thread Marco Cimarosti

Mark Davis wrote:
 Marco, I certainly wouldn't draw that conclusion. This is not 
 the appropriate forum for a political or ethical discussion,

Of course. I just noticed that those numbers reflect a sad fact of life:
that rich people get more than poor people. As this fact is so obvious to
anyone, I thought that my remark would not have caused a long discussion.

 but equating GDP with more important in any general sense
 is clearly a huge leap, and one that I certainly would not
 make.

But there certainly is a correlation between GDP and what people can buy,
including software.

 The goal of the chart was different. Many people mistakenly 
 think the potential customer base of non-English-speakers
 is smaller than it actually is.

Ah, I didn't imagine it from this point of view. For people who live in
non-English-speaking countries, it is easier to remember that English is not
the only language in the world.

I thought the chart was intended as a rationale for prioritizing the support
of languages in consideration of the profitability of the corresponding
markets:

1. support for Western languages is priority one, as it corresponds to the
largest slice of market;

2. CJK support comes immediately after, as it corresponds the second largest
market;

3. then comes Bidi support, which corresponds to a smaller but still
interesting market.

4. Indic support can wait, as the corresponding market is less profitable.

This is, IMHO, how people paying our salaries would read the chart. I am not
even blaming them, as that is probably the correct reading, by the point
of view of business.

_ Marco

RE: GDP by language

2003-10-22 Thread Marco Cimarosti

Mark Davis wrote:
 BTW, some time ago I had generated a pie chart of world GDP 
 divided up by language.

Those quotients are immoral.

Of course, this immorality is not the fault of he who did the calculation:
the immorality is out there, and those infamous numbers are just an
arithmetical expressions of it.

In practice, those quotients say that, e.g., Italian (spoken by 50 millions
people or less) is more important than Hindi (spoken by nearly one billion
people), just because an average Italian is richer than an average Indian.

In other terms, each Indian (or any other citizen from poor countries) has
1/20 or less of the linguistic rights of an Italian (or any other citizen
from rich countries).

BTW, by summing up languages written with the same script, it is easy to
derive the immoral quotients of writing systems:

Latin   59.13%
Han 20.60%
Arabic   3.82%
Cyrillic 2.99%
Devanagari   2.54%
Hangul   1.84%
Thai 0.87%
Bengali  0.44%
Telugu   0.42%
Greek0.40%
Tamil0.34%
Gujarati 0.26%

_ Marco

RE: Line Separator and Paragraph Separator

2003-10-21 Thread Marco Cimarosti

Jill Ramonsky wrote:
 [...] I've even invented (and used) some 8-bit encodings which
 leave the whole of Latin-1 unchanged (apart from the C1s) and use C1
 characters a bit like surrogate pairs to reach the rest.

Doug, are you listening? It seems there's a new clone of UTF:-)Z waiting for
implementation!

_ Marco

RE: Swahili Banthu

2003-10-20 Thread Marco Cimarosti

Peter Kirk wrote:
 Are we talking about a real non-Latin script, some kind of 
 syllabary or logographic script, for Swahili and other
 Bantu languages? [...]
 Or did someone not notice that Marco's comments were about 
 the word joke?

Indeed.

In the last few months, I have been relatively serious, so someone may not
know or remember that I am the unofficial Unicode List's clown.
*|:o)

_ Marco

RE: Swahili Banthu

2003-10-20 Thread Marco Cimarosti

Philippe Verdy wrote:
 As Africa has been influenced by many foreign invasions, 
 there may in fact exist other scripts to represent this
 language [...]

Yes: until a recent past, Swahili was also commonly written in the Arabic
alphabet.

_ Marco

RE: PUA

2003-10-20 Thread Marco Cimarosti

Chris Jacobs wrote:
 [...]
 Nevertheless I think if Unicode don't want to decide how the 
 PUA is to be interpreted

Please take notice of this interpreted: I'll come back to this soon.

 it should be at the very least provide a mechanism by which
 an user of the PUA can specify which specification he
 prefers.
 
 I plan to propose such a mechanism:
 
 I want to propose a char with the following properties:
 
 Scalar Value: U+E0002
 
 This starts a PUA interpretation

Again, please take notice of this interpretation.

 selector tag. The content of the tag is a Font family
 name. For all PUA chars between this tag and the
 corresponding Cancel tag the copyright holder of the font
 is the sole authority about how the PUA should
 be interpreted.

Again, interpreted...

 Any comments?

Yes.

A font tells me how a certain run of text should be *displayed* in rich
text, not how it should be *interpreted* in plain text.

Imagine that I have been asked to write a function AreTheseLetters() which
gets a string argument (i.e., a piece of plain text) and returns a Boolean
value indicating whether all the characters in it are letters.

For non-PUA characters, I already implemented this using Unicode's General
Category property: I decided that all characters whose General Category is
L* are letters. My default assumption about PUA characters is that they
are not letters.

So far so good. Now I want to use your PUA Plan-14 tags, if present, to
override the above assumption about PUA characters. E.g., imagine that my
string contains this:


󠀀󠀂󠁆󠁯󠁏󠁢󠁡󠁲󠀮󠁴󠁴󠁦󠁿
(U+0E U+0E0002 U+0E0046 U+0E006F U+0E004F U+0E0062 U+0E0061
U+0E0072 U+0E002E U+0E0074 U+0E0074 U+0E0066 U+0E007F U+E017 U+E009)

This is what I am going to do:

1) I parsing the tags at the beginning of the string and save the relevant
information in a temporary variable which we will call PuaInterpretation;

2) I remove the tags.

Now, my PuaInterpretation variable contains the following information:

Foobar.ttf

And my string contains the following text:


(U+E017 U+E009)

Now, what's the next step? What am I supposed to do to find out whether,
according to the PUA interpretation called Foobar.ttf, U+E017 and U+E009
are letters or not?

_ Marco

RE: Klingons and their allies - Beyond 17 planes

2003-10-17 Thread Marco Cimarosti

John Cowan wrote:
 You persist in misunderstanding.  Suppose I came along and told you
 I wanted to create a Unicode codepoint for each word in every language
 on Earth.  Would you blithely allocate me a 24-billion-codepoint
 private space?

Why? 200 millions should be more than enough: that's more than 30.000 words
for each living language.

Of course, you should only encode abstract words, such as ENGLISH VERB
JOKE, and combining morphemes such as ENGLISH COMBINING INFLECTION PAST
TENSE, ENGLISH COMBINING INFLECTION PRESENT PARTICIPLE, etc.

It will be the task of the uttering engine to utter a sequence like ENGLISH
VERB JOKE + ENGLISH COMBINING INFLECTION PRESENT PARTICIPLE with the
ligature joking. Of course, this will only happen with OpenLex-enabled
uttering engines: naive uttering engine based on old TrueLex would render
with the fallback uttering joke -ing.

To make it more interesting, you could also encode a few useless
compatibility presentation inflected forms such ENGLISH VERB SPEAK PAST
TENSE FORM, which will get decomposed to ENGLISH VERB SPEAK + ENGLISH
COMBINING INFLECTION PAST TENSE, and finally be rendered as spoke,
speaked or speak -ed, depending on the platform.

Notice that a few words will need contextual forms, such as ENGLISH
INDETERMINATE ARTICLE, which will display as a or an depending on the
following code point.

Languages, such as Swahili, which use prefixes instead than suffixes will be
encoded in logical order, i.e. with the combining prefix after the root.
It will be the task of the uttering engine to reorder the prefix. E.g., the
Swahili word watu (plural of mtu = man) will be encoded as SWAHILI
NOUN TU + SWAHILI COMBINING INFLECTION PLURAL FOR PEOPLE and, in theory,
it will be rendered as watu. In practice, it will always be rendered as
-tu wa-  because no one will invest in implementing Swahili rendering.

Ciao.
Marco

RE: Canonical equivalence in rendering: mandatory or recommended?

2003-10-15 Thread Marco Cimarosti

Jill Ramonsky wrote:
 In my experience, there is a performance hit.

 I had to write an API for my employer last year to handle
 some aspects of Unicode. We normalised everything to NFD,
 not NFC (but that's easier, not harder). Nonetheless, all
 the string handling routines were not allowed to assume
 that the input was in NFD, but they had to guarantee that
 the output was. These routines, therefore, had to do a
 convert to NFD on every input, even if the input were
 already in NFD. This did have a significant performance
 hit, since we were handling (Unicode) strings throughout
 the app.

 I think that next time I write a similar API, I wll deal
 with (string+bool) pairs, instead of plain strings, with
 the bool  meaning already normalised. This would
 definitely speed things up. Of course, for any strings
 coming in from outside, I'd still have to assume they
 were not normalised, just in case.

You could have split the NFD process in two separate steps:

1) Decomposition per se;

2) Reordering of combining classes.

You could have performed step 1 (which is presumably much heavier than 2)
only on strings coming from outside, and step 2 at every passage.

In a further enhancement, step 2 could be called only upon operations which
could produce non-canonical order: e.g. when concatenating strings but not
when trimming them.

To gain even more speed, you could implement an ad-hoc version of step 2
which only operates on out-of order characters adjacent to a specified
location in the string (e.g., the joining point of a concatenation
operation).

Just my 0.02 euros.

_ Marco

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-09 Thread Marco Cimarosti

Gautam Sengupta wrote:
 --- Marco Cimarosti wrote:
  OK but, then, your ZWJ becomes exactly what
  Unicode's VIRAMA has always
  been: [...]
 
 You are absolutely right. I am suggesting that the
 language-specific viramas be retained as
 script-specific *explicit* viramas that never
 disappear. In addition, let's have a script-specific
 ZWJ which behaves in the way you describe in the
 preceding paragraph.


Good, good. We are making small steps forward.

What are you really asking for is that each Indic script has *two* viramas:


- a soft virama, which is normally invisible and only displays visibly in
special cases (no ligatures for that cluster);

- a hard virama (or explicit virama, as you correctly called it), which
always displays as such and never ligates with adjacent characters.


Let's assume that it would be handy to assign these two viramas to different
keys on the keyboard. Or, even better, let's assign the soft virama to the
plain key and the hard virama to the SHIFT key, OK? To avoid
misunderstandings with the term virama, let's label this key JOINER.

Now, this is what you *already* have in Unicode! On our hypothetic Bangla
keyboard:


- the soft virama (the plain JOINER key) is Unicode's BENGALI SIGN
VIRAMA;

- the hard virama (the SHIFT+JOINER key) is Unicode's BENGALI SIGN
VIRAMA+ZWNJ.


Not only Unicode allows all of the above, but it also has a third kind of
virama, which may or may not be useful in Bangla but is certainly useful
in Devanagari and Gujarati:


- the half-consonant virama (let's assign it to the ALT+JOINER key in out
hypothetical keyboard) which forces the preceding consonant to be displayed
as an half consonant, if possible. This is Unicode's BENGALI SIGN
VIRAMA+ZWJ.


Notice that, once you have these three viramas on your keyboard, you don't
need to have keys for ZWJ and ZWNJ, as their only use, in Indic, is
after a xxx SIGN VIRAMA.

Apart the fact that two of the three viramas are encoded as a *pair* of code
points, how does the *current* Unicode model impede you to implement the
clean theoretical model that you have in mind?


 [...] 
  - independent and dependent vowels were the same
  characters;
 [...] 
 
 I agree with you on all of these issues. You have in
 fact summed up my critique of the ISCII/Unicode model.


OK. But are you sure that this critique should necessarily be moved to the
*encoding* model, rather than to some other part of the chain. I'll now try
to demonstrate how also the redundancy of dependent/independent vowels may
be solved at the *keyboard* level.

You are certainly aware that some national keyboards have the so-called
dead keys. A dead key is a key which does not immediately send (a)
character(s) to the application but waits for a second key; in European
keyboards dead keys are used to type accented letters. E.g., let's see how
accented letters are typed on the Spanish keyboard (which, BTW, is by far
the best designed keyboard in Western Europe):


1. If you press the ´ key, nothing is sent to the application, but the
keystroke is memorized by the keyboard driver.

2. If you now press one of a, e, i, o, u or y keys, characters
á, é, í, ó, ú or ý are sent to the application.

3. If you press the space bar, character ´ itself is sent to the
application;

4. If you press any other key, e.g. m, the two characters ´ and m are
sent to the application in this order.


Now, in the description above substitute:


- the ´ key with 0985 BENGALI LETTER A (but let's label it VIRTUAL
CONSONANT);

- the a ... y keys with 09BE BENGALI VOWEL SIGN AA ... 09CC BENGALI
VOWEL SIGN AU;

- the á ... ý characters with 0986 BENGALI LETTER AA ... 0994 BEGALI
LETTER AU.


What you have is a Bangla keyboard where dependent vowels are typed with a
single vowel keystroke, and independent vowels are typed with the sequence
VIRTUAL CONSONANT+vowel.

Do you prefer your cons+VIRAMA+vowel model? Personally, I find it is
suboptimal, as it requires, on average, more keystrokes. However, if that's
what you want, in the Spanish keyboard description above substitute:


- the ´ key with the unshifted JOINER (= virama) key that we have
already defined above;

- the a ... y keys with 0986 BENGALI LETTER AA ... 0994 BEGALI LETTER
AU;

- the á ... ý characters with 09BE BENGALI VOWEL SIGN AA ... 09CC
BENGALI VOWEL SIGN AU.


Now you have a Bangla keyboard where independent vowels are typed with a
single keystroke, and dependent vowels are typed with the sequence
JOINER+vowel.


_ Marco

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-08 Thread Marco Cimarosti

Gautam Sengupta wrote:
 Is there any reason (apart from trying to be
 ISCII-conformant) why the Bangla word /ki/ what
 cannot be encoded as [KA][ZWJ][I]? Do we really need
 combining forms of vowels to encode Indian scripts?

Perhaps you are right that it *would* have been a cleaner design to have
only one set of vowel.

But notice that KA+I is one character longer that KA+I. Maybe
storage space is not a big problem these days, but still it makes 2 to 4
extra bytes for each consonant not followed by the inherent vowel /a/.

Perhaps it *would* have been better to have only the combining vowels, and
to form independent vowels with a mute consonant (actually, the
independent vowel a).

 Also, why not use [CONS][ZWJ][CONS] instead of
 [CONS][VIRAMA][CONS]? One could then use [VIRAMA] only
 where it is explicit/visible.

OK. But what happens when the font does not have a glyph for the ligature
consZWJcons, nor for the half consonant consZWJ, nor for the
subjoined consonant ZWJcons?

As ZWJ, per se, is an invisible character, what happens is that your
string displays as conscons, which is clearly semantically incorrect. If
you want the explicit virama to be visible, you need to encode it as
consVIRAMAcons.

And this means that you (the author of the text) are forced to chose between
ZWJ and VIRAMA based on the availability of glyphs in the *particular*
font that you are using while typing. And this is a big no no no, because it
would impede you to change the font without re-typing part of the text.

What happens with the current Unicode scheme is that, if the font does not
have a glyph for the ligature consVIRAMAcons, nor for the half
consonant consVIRAMA, nor for the subjoined consonant VIRAMAcons,
the virama is *automatically* displayed visibly, so that the semantics of
the text is always safe, even if rendered with the most stupid of fonts.

 Surely, [A/E][ZWJ][Y][ZWJ][AA] is more natural and
 intuitively acceptable than any encoding in which a
 vowel is followed by a [VIRAMA]?

Maybe. But I see no reason why being natural or intuitive should be seen as
key feature for an encoding system. That might be the case for an encoding
system designed to be used by humans, but Unicode is designed to be used by
computers, so I don't see the problem.

I assume that in a well designed Bengali input method, yaphala would be a
key on its own, so, by the point of view of the user, it is just a
character: they don't need to know that when they press that key the
sequence of codes VIRAMAYA will actually be inserted, so they won't
notice the apparent nonsense of the sequence vowelVIRAMA and, as we say
in Italy, If eye doesn't see, heart doesn't hurt.

_ Marco

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-08 Thread Marco Cimarosti

Peter Kirk wrote:
 I don't understand the specific issues here... But it does 
 seem a rather strange design principle that we should
 expect a text to be displayed meaningfully even when the font
 lacks the glyphs required for proper display.

The fact is that these glyphs are not necessarily *required*. Each Indic
script has a relatively small set of glyphs that are absolutely required in
any font, but also an unspecified number of ligature that may or may not be
present.

This may depend from the language (e.g., Devanagari for Sanskrit typically
uses more ligatures than Devanagari for Hindi), or even simply be a matter
of typographical style.

_ Marco

RE: Bangla: [ZWJ], [VIRAMA] and CV sequences

2003-10-08 Thread Marco Cimarosti

Gautam Sengupta wrote:
 I am no programmer, but surely the rendering engine
 could be tweaked to display a halant/hashant in the
 aforementioned situations? I understand that it won't
 happen *automatically* if we were to use ZWJ instead
 of VIRAMA. But if you were to take the trouble to do
 the tweaking, you'd then have a completely *intuitive*
 encodings for vowel yaphala sequences,
 vowelZWJY, instead of oddities like
 vowelVIRAMAY.

OK but, then, your ZWJ becomes exactly what Unicode's VIRAMA has always
been: a character that is normally invisible, because it merges in a
ligature with adjacent characters, but occasionally becomes visible when a
font does not have a glyph for that combination.

But there is one detail which makes your approach much more complicated:
what we have been calling VIRAMA is *not* a single character. Every Indic
script has its own: DEVANAGARI SIGN VIRAMA, BENGALI SIGN VIRAMA, and so
on.

Each one of these characters, when displayed visibly, has a distinct glyph:
a Bangla hashant is a small / under the letter, a Tamil virama is a dot
over the letter, etc.

With your approach, the single character ZWJ is overloaded with a dozen
different glyphs depending on which script the adjacent letters belong to.
Plus, it still has to be invisible when used in a non-Indic script, such as
Arabic.

Implementing all this is certainly possible, but would result in bigger
look-up tables, for no advantage at all.

 Perhaps there isn't a *problem* as such, and perhaps
 naturalness and intuitive acceptability aren't *key*
 features of the system, but surely other factors being
 equal they ought be taken into consideration in
 choosing one method of encoding over another?

Yes. But the flaws that I see in ISCII/Unicode model are much smaller than
you imply. E.g., I agree that it would have been more logic if:


- independent and dependent vowels were the same characters;

- each script was encoded in its natural alphabetical order;

- there were no precomposed and decomposed alternatives for the same
graphemes.


And others, on which perhaps a linguist won't agree, but which would have
made life much easier to programmers:


- all vowels were encoded in visual order, so that vowel reordering was
necessary;

- repha ra were encoded as a separate characters, so that no reordering at
all was necessary.


But, all summed up, leaving with these little flaws is *much* simpler than
trying to change the rules of a standard a dozen years after people started
implementing it.

_ Marco

Bogus UTF's are back! :-) (was RE: Non-ascii string processing?)

2003-10-07 Thread Marco Cimarosti

Doug Ewell wrote:
 [...]
  we'd all use UTF-336. Er?
 
 If only I had a bit more spare time, Jill.  You do NOT want to get me
 started...  :-)

Go for it, Doug! :-)

If I only had a bit of spare time myself, I'd be eager of running
bits-per-character statistics for UTF:-)336 in various languages...
Something makes me think that it would rank even worse than UTF-32 and
UTF:-)64.

BTW, how about the convention of using the UTF:-) prefix for our
bogus/model/parody UTF's?

_ Marco

RE: Unicode Public Review Issues update

2003-10-07 Thread Marco Cimarosti

Jony Rosenne wrote:
 I don't remember whether Hebrew Braille is written RTL or LTR.

Braille is always LTR, even for Hebrew and Arabic.

To be more precise, Braille is always LTR when you read it, but RTL when you
write it manually (because it is engraved on the back side of the paper,
using a dotted rule and a stylus).

_ Marco

Braille is not bidi neutral! (was RE: Unicode Public Review Issue s update)

2003-10-07 Thread Marco Cimarosti

I (Marco Cimarosti) wrote:
 Jony Rosenne wrote:
  I don't remember whether Hebrew Braille is written RTL or LTR.
 
 Braille is always LTR, even for Hebrew and Arabic.

Hwæt! I noticed only now that the Bidirectional Category of braille
characters is ON - Other neutrals!

AFAIK, that is completely broken!

A run of braille text *must* remain LTR in a bidi context, because braille
is never RTL. This ON bidi category makes it impossible to correctly
encode, e.g., a manual of Braille in Hebrew or Arabic, because the braille
runs would get swapped by the Bidirectional Algorithm.

_ Marco

RE: What things are called (was Non-ascii string processing)

2003-10-07 Thread Marco Cimarosti

Jill Ramonsky wrote:
 Hey - the public will just have to get used to it!

No, the public should not be bored with these technical details: in the user
manual, a book will still be a book. The fact that, in the source code
of the application book means something else if of interest only to
programmers.

_ Marco

RE: Non-ascii string processing?

2003-10-07 Thread Marco Cimarosti

Peter Kirk wrote:
 For i% = 1 to Len(utf8string$)
 c$ = Mid(utf8string$, i%, 1)
 Process c$
 Next i%
 
 Such a loop would be more efficient in UTF-32 of course, but this is 
 still a real need for working with character counts.

If the string type and function of this Basic dialect is not Unicode-aware,
then:

- Len(s$) returns the number of *bytes* in the string;

- Mid(s$, i%, 1) returns a single *byte*;

- Your Process() subroutine won't work...

If the string type and functions are Unicode aware (as, e.g., in Visual
Basic or VBScript), then I'd expect that the actual internal representation
is hidden from the programmer, hence it makes no sense to talk about an
UTF-8 string.

_ Marco

RE: Non-ascii string processing?

2003-10-07 Thread Marco Cimarosti

Elliotte Rusty Harold wrote:
 A W3C XML Schema Language validator needs a character based API to 
 correctly implement the minLength and maxLength facets on xsd:string 

As far as I understand, xsd:string is a list of Character-s, and a
Character is an integer which can hold any valid Unicode code point.

In other terms, xsd:string is necessarily in UTF-32 (or something close to
it): it cannot be in UTF-8 or UTF-16.

The numbers returned by length, minLength and maxLength are the actual,
minimum and maximum number of *list elements*, contained in the list. I.e.,
in the case of xsd:string, the *size* of the string in *encoding units*.

The fact that, in UTF-32, the *size* of the sting in encoding units
corresponds to the number of characters is coincidental.

In any case, the useful information is always the *size* of the string in
encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the
number of characters it contains.

_ Marco

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti

Doug Ewell wrote:
 Depends on what processing you are talking about.  Just to cite the
 most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
 strlen() will fail dramatically.

Why? The purpose of strlen() is counting the number of *bytes* needed to
store a certain string, and this works just as fine for UTF-8 as it does for
SBCS's or DBCS's.

What strlen() cannot do is countng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.

_ Marco

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti

Theodore H. Smith wrote:
 Hi lists,

Hi, member.

 I'm wondering how people tend to do their non-ascii string processing.

I think no one has been doing ASCII string processing for decades. :-) But I
guess you meant non-SBCS (single byte character set) string processing.

 [...]
 So, I'm wondering, in fact, is there ANY code that needs 
 explicit UTF8 processing? Heres a few I've thought of.

In general, you need to UTF-32 whenever you need to access the single
*characters* in a string. This is needed for all kinds of lexical or
typographic processing, e.g.:

- case matching or conversion ( vs. );

- loose matching ( vs. a);

- displaying the text;

 Can anyone tell me any more? Please feel free to go into great detail 
 in your answers. The more detail the better.

There is at least one case in which you need UTF-8-aware code even if not
accessing single characters: it is when you *trim* a string at an arbitrary
byte position. E.g.:

char str1 [9] = abc;
char * str2 = ;

strncat(str1, str2, sizeof(str1));

If strncat() is UTF-8 aware: str1 will be abc + null terminator (8
bytes). But if strncat() is *not* UTF-8 aware, str1 will contain an invalid
UTF-8 string: abc + an *llegal* byte (0xCE) + null terminator.

_ Marco

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti

Stephane Bortzmeyer wrote:
 On Mon, Oct 06, 2003 at 12:09:34PM +0200,
  Marco Cimarosti [EMAIL PROTECTED] wrote 
  a message of 14 lines which said:
 
  What strlen() cannot do is countîng the number of 
 *characters* in a string.
  But who cares? I can imagine very few situations where 
 someone such an
  information would be useful.
 
 It is one thing to explain that strlen() has byte semantics and not
 character semantics. It is another to assume that character semantics
 are useless.

I never said that character semantics are useless: I said that it is almost
always useless to count the *number* of Unicode characters in a string.

One of the few cases in which such a count could be useful is to
pre-allocate a buffer for an UTF-8 to UTF-32 conversion. But there is no
need of a general purpose API function for such a special need.

 Most text-processing software allow you to count the
 number of characters in a document, for instance.

Yes. And:

1) That is a very special need of a very special kind of application (a word
processor), so it doesn't justify a general purpose API function for that:
people don't normally write word processors every day.

2) That count cannot be done by counting Unicode characters (i.e.,
encoding units): you have to count the object that the user perceives as
typographical characters. E.g., control or formatting characters should
be ignored, sequences of two or more space characters should be counted as
one, and a word like élite is always counted as five characters,
regardless that it might be encoded as six Unicode characters. In an Indic
or Korean text, each syllable counts as a single character, although it may
be encoded as long sequences of Unicode characters.

3) That is a very silly count anyway. If you want to have an idea of the
size of a document, lines or words are much more useful units.

 Any decent Unicode programmaing environment should give you two
 functions, one for byte semantics and one for character
 semantics. Both are useful.

OK. But the length in characters of a string is not character semantics:
it's plain nonsense, IMHO.

_ Marco

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti

Stephane Bortzmeyer wrote:
  OK. But the length in characters of a string is not 
 character semantics:
  it's plain nonsense, IMHO.
 
 I disagree.

Feel free.

But I still don't see any use in knowing how many characters are in an UTF-8
string, apart the use that I already mentioned: allocating a buffer for a
UTF-8 to UTF-32 conversion.

_ Marco

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti

Edward H. Trager wrote:
  But I still don't see any use in knowing how many 
 characters are in an UTF-8
  string, apart the use that I already mentioned: allocating 
 a buffer for a
  UTF-8 to UTF-32 conversion.
 
 Well, I know a good use for it: a console or terminal-based 
 application which displays information using fixed-width
 fonts in a tabular form, such as a subset of records from
 a database table.  To calculate how wide to display each
 column, knowing the maximum number of characters in the
 strings for each column is a useful starting place.  

Well, I am just about to start a time consuming task: fixing an application
which was based on the assumption the number of characters in a string was
good starting place to format tabular text in a fixed width font...

You have already explained why this can't work when CJK or other scripts pop
in.

What you really need for such a thing is a function which computes the
width of a string in terms of display units, rather than its length in
term of characters.

_ Marco

RE: FW: Web Form: Other Question: British pound sign - U+00A3

2003-10-03 Thread Marco Cimarosti

 This (Peter's) answer is, in my understanding, the nearest to the 
 truth.

He made the same assumption I did: you declared that your file was UTF-8 but
actually it wasn't. :-)

 Here is the problem:
 
 How do I make my keyboard which only produces 8-bit [...]

The keyboard has nothing to do with it. The problem is how you save the
file.

You should see if the Save as... command of your text editor (or HTML
authoring tool) has an option like Save as UTF-8.

If it doesn't, see if your Notepad utility has it option. If it's there (I
don't remember in which version of Windows it was added). You just open your
file and save it selecting Save as UTF-8.

You can also use an utility to convert the character set. E.g., try the
iconv.exe utility from libiconv
(http://sourceforge.net/project/showfiles.php?group_id=23617).

Ciao.
Marco

RE: Web Form: Other Question: British pound sign - U+00A3

2003-10-01 Thread Marco Cimarosti

[EMAIL PROTECTED] wrote (through Magda Danish):
[...]
  Our problem is the representation of the £ sign (British 
  pound sign - U+00A3). When we type this character into our 
  pages and then set the character encoding in our pages to 
  Unicode (UTF-8) (either by setting it directly in the HTTP 
  header, or setting it using the meta 
  http-equiv=Content-Type content=text/html; charset=utf-8 

I think it should be charset=UTF-8, in capital letters. I was looking into
the IANA charsets today, and I don't remember having seen a lowercase alias
for that.

  tag), when we view the pages we see the standard ASCII set of 
  characters, but the Pound sign displays as an error.

The most obvious question is: are your pages *actually* in UTF-8? It is not
enough that you *declare* that they are UTF-8 if you didn't actually save
them as UTF-8 with your editor.

Could you put on line a small test page containing the pound symbol and post
the URL?

  Also which version of Unicode does HTML 4.0 support using 
  escape characters (eg. #163)?

It doesn't matter which version of Unicode it is, because the pound symbol
is in from day zero.

Notice however that HTML character reference must end with a semicolon:

#163;

_ Marco

RE: Internal Representation of Unicode

2003-09-26 Thread Marco Cimarosti

[EMAIL PROTECTED] wrote:
 In a plain text environment, there is often a need to encode more than
 just the plain character.  A console, or terminal emulator, is such an
 environment.  Therefore I propose the following as a technical report
 for internal encoding of unicode characters; with one goal in mind:
 character equalence is binary equalence.

I guess you meant equivalence.

Q1: But what are character equivalence and binary equivalence, and
why did you choose them as your goals?

 I thought of dividing the 64 bit code space into 32 variably wide
 plains,

Q2: What are these plains for? Why are there 32 of them?

 one for control characters, one for latin characters, one for
 han characters,

Q3: Why do you want to treat Latin character and Han characters
differently?

There is nothing special with Latin or Han characters in Unicode: they are
just 2 of the about 50 scripts currently supported in Unicode. (see
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Scripts.txt)

Q4: And how do you plan to distinguish them?

Both Latin and Han characters are scattered all over the Unicode space, so
you need to check many ranges to determine which character belongs to which
category.

Q5: And what about all character which are neither Latin nor Han?

 and so on; using 5 bits and the next 3 fixed to zero
 (for future expansion and alignment to an octet).
 I call plain 0 control characters and won't discuss it further.

Q6: Why do control characters have a special handling?

Q7: Don't control characters have properties attached like any other
characters?

One example of properties which could be useful to attach to control
character is directionality. E.g., a TAB is always a TAB but, after it
passed through the Bidirectional Algorithm, its directionality can be
resolved to be either LTR or RTL.

 Plain 1, I had intended for latin characters with the following
 encoding method in mind:
 
 bits 63..59  58..56 55..40 39..32 31..24 23..16 15..8  7..0
 +---+--+--+--+--+--+--+--+
 | plain | zero | attr | res  | uacc | lacc | res  | char |
 +---+--+--+--+--+--+--+--+
 
 * Plain Plain(5 bits)
 * Zero  Zero bits(3 bits)
 * Attr  Attributes   (16 bits)

Q8: What kind of information are these three fields for?

Q9: In case your answer to Q8 is they are application-defined, then
what is the rationale for defining and naming more than one field? I mean:
if they are application-defined, why not leave the task of defining
sub-fields to the application?

 * Res   Reserved (8 bits)
 * Uacc  Upper Accent (8 bits)
 * Lacc  Lower Accent (8 bits)

Q10:Why do treat accents specially?

They are just characters as any others. In Unicode there is no special
limitation as to how many accents can be applied to a base character.
There is also no obligation for accents to have a base character.

 * Res   Reserved (8 bits)
 * Char  Character(8 bits)

Q11:How can you store a Latin character in 8 bits?

Unicode has 938 Latin characters, and their codes range from U+0041 to
U+FF5A.

 All of these fields are actually implementation defined, with just one
 rule for char: don't include characters that can be made with
 combinations, that's what the accent fields are for.

But characters are non necessarily decomposed in one Latin character with
one upper accent and one lower accent. E.g., U+01D5 (LATIN CAPITAL
LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304
(LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both
COMBINING DIAERESIS and COMBINING MACRON are upper accents.

Q12:How are you going to deal with a combination of, e.g., a base letter
+ 5 upper accents + 3 lower accents?

  This allows for 255 upper and lower accents which should be enough -- for
now.

I counted 129 upper accents. But their codes range from U+0300 to U+1D1AD.

Q13:How are you going to compress these codes into 8 bits? Are you
planning to use a conversion table from the Unicode code to your internal
8-bit code?

 For Han characters I thought of the following encoding method (with no
 particular plain in mind):
 
 bits 63..59  58..56 55..40 39..32  31 ..0
 +---+--+--+---+--+
 | plain | zero | attr | style |  char|
 +---+--+--+---+--+
 
 * Plain Plain(5 bits)
 * Zero  Zero bits(3 bits)
 * Attr  Attributes   (16 bits)
 * Style   Stylistic Variation  (8 bits)

Q14:What kind of information is in field Style?

Q15:Why do only Han characters have this?

Letters in many other scripts may have stylistic variations. E.g.,

RE: About that alphabetician...

2003-09-25 Thread Marco Cimarosti

Michael Everson wrote:
 At 08:33 -0700 2003-09-25, John Hudson wrote:
 
 Unicode is an encoding standard for text on computers that allows 
 documents in any script and language to be entered, stored, edited 
 and exchanged.
 
 blank stare from layman

Unicode is a code in which every letter of every alphabet in the world
corresponds to a number. This numeric code is used to write text inside
computers, because only can be written numbers inside computers. When the
computer shows on the screen the text which it has inside, it draws the
letters corresponding to the Unicode numbers which it has inside.

My 4 years listened to this explanation and said everything was clear.

The only problem is that he now wants to disassemble my computer to see the
numbers it has inside. He thinks that the numbers are stored in the form of
talking ladybugs which would say out the number when you tip on them (he
gained this idea from one of his favorite books: Learn the Numbers with the
Talking Ladybugs).

_ Marco

RE: [OT?] QBCS

2003-08-29 Thread Marco Cimarosti

Doug Ewell wrote:
 [...]
 (BTW, pet peeve:  The word acronym should only be used to mean a
 pronounceable WORD (nym) formed from the initials of other words.
 Classic examples are scuba and radar.  If you can figure 
 out how to pronounce qbcs, more power to you, but to me it's just
 an abbreviation.)

Right, sorry.

(I can pronounce ['kubks], although I wouldn't do it in front of my managers
and customers. :-)

Actually, I don't like this QBCS term and I'd rather avoid saying it
myself. But I wanted to be sure other people mean when they use it.

 [...]
  So what it really means must be quadra-byte character
  encoding, and both GB 18030 and UTF-32 should fit
  into that category.
 
 GB 18030, yes, because its code units vary from one to four bytes in
 length.  UTF-32, no, because its code units are uniformly 32 bits.

But UTF-8 fits the definition.

_ Marco

[OT?] QBCS

2003-08-28 Thread Marco Cimarosti

It seems that the IT world has a new acronym: QBCS. I understand that it
stands for quadra-byte character set, and I heard it used to refer to GB
13030.

My question is: it just a fancy sinomym for GB 13030 or can it also refer to
Unicode or other encodings?

Thanks in advance.

_ Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-26 Thread Marco Cimarosti

I posted my feedbacks through the report forms. The text of the two posts is
attached.

(I considerably shortened the list of non-Latin punctuation marks that I
suggest to exclude from identifiers, although I added two of the Hebrew
punctuation marks suggested by Kirk.)

_ Marco

Feedback on UTR#31 (draft 1): Full/Half-Width Characters.

I suggest that all compatibility character which are labelled wide, narrow and 
small and whose compatibility decompositions is already in class Pattern_Syntax be 
added in class Pattern_Syntax as well.

In practice, I am suggesting to add the following lines to section 4.1 Proposed 
Pattern Properties:

FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK
FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN
FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT
FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH SOLIDUS
FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT
FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE BRACKET..FULLWIDTH GRAVE 
ACCENT
FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY BRACKET..FULLWIDTH TILDE
FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE PARENTHESIS..HALFWIDTH 
IDEOGRAPHIC FULL STOP
FF64   ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN
FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN
FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WHITE 
CIRCLE

Rationale. These characters are almost identical, visually and semantically, to their 
normal width counterparts. Allowing such characters in identifiers means allowing 
identifiers which look identical to expressions of a totally different kind. E.g., an 
identifier such as foo，bar (where ， is U+FF0C FULLWIDTH COMMA), would look 
identical to expression foo, bar (identifier foo + comma + space + identifier 
bar).

Regards.
Marco Cimarosti ([EMAIL PROTECTED])

Feedback on UTR#31 (draft 1): Non-Latin Punctuation.

I suggest that a small set of non-Latin punctuation marks be added in class 
Pattern_Syntax. Each one of the punctuation marks that I am suggesting to include 
complies with the following conditions:

1) It is very similar in shape to an ASCII-range character which is already in class 
Pattern_Syntax;

2) It is very similar in function to an ASCII-range character already which is in 
class Pattern_Syntax;

3) It is used in the modern orthography of modern languages and/or it is commonly 
available on national keyboards;

4) It is not commonly used to form words or phrases which may be used as identifiers.

In practice, I am suggesting to add the following lines to section 4.1 Proposed 
Pattern Properties:

037E   ; Pattern_Syntax # GREEK QUESTION MARK
0387   ; Pattern_Syntax # GREEK ANO TELEIA
055C..055E ; Pattern_Syntax # ARMENIAN EXCLAMATION MARK..ARMENIAN QUESTION MARK
0589   ; Pattern_Syntax # ARMENIAN FULL STOP
05C0   ; Pattern_Syntax # HEBREW PUNCTUATION PASEQ
05C3   ; Pattern_Syntax # HEBREW PUNCTUATION SOF PASUQ
060C..060D ; Pattern_Syntax # ARABIC COMMA..ARABIC DATE SEPARATOR
061B   ; Pattern_Syntax # ARABIC SEMICOLON
061F   ; Pattern_Syntax # ARABIC QUESTION MARK
066A..066C ; Pattern_Syntax # ARABIC PERCENT SIGN..ARABIC THOUSANDS SEPARATOR
06D4   ; Pattern_Syntax # ARABIC FULL STOP
066D   ; Pattern_Syntax # ARABIC FIVE POINTED STAR
0964..0965 ; Pattern_Syntax # DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA
10FB   ; Pattern_Syntax # GEORGIAN PARAGRAPH SEPARATOR
1362..1368 ; Pattern_Syntax # ETHIOPIC FULL STOP..ETHIOPIC PARAGRAPH SEPARATOR

Rationale. Punctuation marks complying with conditions #1 to #3 may easily be cofused 
with ASCII-range characters which are normally used in the syntax of computer 
languages and notations. Allowing such character in identifiers would mean to allow 
identifiers which look almost identical to expressions of a totally different kind. 
E.g., an identifier such as return; (where ; is U+037E GREEK QUESTION MARK), 
looks identical to expression return; (identifier or keyword return + semicolon). 
However, punctuation marks mentioned in condition #4 (e.g. syllable separators, 
morpheme separators, abbreviation marks, diacritic marks, apostrophes) are excluded 
from my suggestion (i.e. I suggest to allow them in identifiers) because they are 
useful to form words or phrases which may act as identifiers.

Character-by-character rationale. In the following list, I listed each suggested 
character along with the ASCII-range character which looks similar to it (as per 
condition #1 above) and with the ASCII-range character

[OT?] ICU training offerings anyone?

2003-08-26 Thread Marco Cimarosti

Dear Unicoders,

Does any company offer training on ICU programming? I am more interested in
courses located in Europe, but I'd also be glad to know about courses in
North America or elsewhere.

If you feel that this information is not appropriate for the public list,
please feel free to reply privately.

Thank you in advance. Regards.

_ Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-25 Thread Marco Cimarosti

Peter Kirk wrote:
 Similarly, Hebrew geresh and gershayim look like quotation 
 marks and are used interchangeably in legacy encodings,
 the same with maqaf and hyphen 
 - maqaf is very much the cultural equivalent of hyphen, and I 
 have seen recent discussion about whether the hyphen key on a
 Hebrew keyboard ought actually to generate a maqaf.

No, wait. The fact that maqaf id the cultural (and visual) equivalent of a
hyphen, is a good reason to *exclude* it from class Pattern_Syntax, i.e.
*allow* it in identifiers, so that composite words can be used as
identifier.

 As an ordinary Latin hyphen is already in the list, by your
 argument there is no reason to exclude other things that
 look like it and function like it.

I guess that the only reason why the ASCII '-' is included in
Pattern_Syntax is that it is also used as minus. If if only had the
meaning hyphen, it would not be in Pattern_Syntax.

_ Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-25 Thread Marco Cimarosti

Peter Kirk wrote:
 Well, the situation with Hebrew sof pasuq is almost identical to that 
 for Greek and Arabic question marks, except that it is functionally a 
 full stop not a question mark, so I can't see any reason other than 
 prejudice for omitting it from the list.

Well, I had a much better reason than prejudice: ignorance. :-)

That's why I told that my two lists were tentative and asked for comments.

 [...] - maqaf is very much the cultural equivalent of hyphen, and I 
 have seen recent discussion about whether the hyphen key on a
 Hebrew keyboard ought actually to generate a maqaf. [...]
 I'm not talking about biblical Hebrew here, I'm talking about 
 a living modern language. [...]

And these are exactly the kinds of reasons that I was looking for: any
punctuation used in modern text and/or available on keyboards are what good
candidates for inclusion in Pattern_Syntax (i.e., exclusion from usage in
identifiers).

 [...] extend the list to include all punctuation and to allow as
 yet undefined characters to be added to it.

Well, the requirement for an invariable set seems to be part of the rules
of the game with this UTR, so I'll stick to it.

I guess that this requirement is due to backward compatibility issues. If
version X of a certain programming language accepts identifier foo:bar
(where : is a certain mark), it is not acceptable that version X+1 of the
same language treats the same sequence as a syntax error: that would make
existing source code in that language potentially unusable.

Obviously, it must be possible to customize the default sets: most
*existing* computer languages already allow in identifiers characters that
would be excluded by UTR#31 (e.g.: _ - $ ').

_ Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-25 Thread Marco Cimarosti

Peter Kirk wrote:
 But the other way round is less of a problem. So I am suggesting that 
 for now we define all punctuation characters except for those with 
 specifically defined operator functions, also all undefined 
 characters, as giving a syntax error. This makes it possible
 to define additional punctuation characters, even those in so far
 undefined scripts like Tifinagh, as valid operators in future
 versions.

Yes, but this makes it impossible to use any as-yet undefined scripts in
identifiers! E.g., you'd never be able to write a variable name in Tifinagh
letters in future versions!

Unless you are still thinking at non-fixed sets, in which case I must remind
you again that there are no balls or door-keepers in a card game... :-)

Ciao.
Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-22 Thread Marco Cimarosti

Rick McGowan wrote:
 the process as possible so that it can be considered 
 The draft is found at http://www.unicode.org/reports/tr31/
 and feedback can be submitted as described there.

(Before submitting official feedback, I'd like to discuss my comments here.
BTW, which Type of Message should I use in the feedback form? Is it OK to
use Technical Report or Tech Note issues?)


My two cents are both about adding characters in the Pattern_Syntax of
4.1 Proposed Pattern Properties.

IMHO:

1. Full-width, half-width, and small punctuation characters should
in class Pattern_Syntax as their normal width counterparts.

2. Non-Latin punctuation character should be in class
Pattern_Syntax as their Latin counterparts.

The rationale for suggestion 1 is that wide, narrow and small
compatibility characters are substantially identical (in appearance and
function) to their normal width counterparts. A parser allowing an
unquoted full-width punctuation character in an identifier is guaranteed to
cause confusion to the user.
 
E.g., consider the following expression:

foobar

To me, it *definitely* looks like two identifiers separated by a comma, and
I expect my parser to agree with me on this, even if the comma is actually
a full-width comma. I am not saying that the parser must necessarily accept
a full-width comma in that position: it is perfectly OK if the above
expression causes a syntax error such as: Illegal character U+FF0C
(FULLWIDTH COMMA) after identifier foo'.

But what the parser should absolutely *not* do, IMHO, is handling foobar
as a *single* identifier! Doing such a thing is guaranteed to cause troubles
to me. E.g., I might receive a puzzling error message saying: Parameter
missing: this statement requires 2 parameters, while I can *see* that there
*are* two parameters:  foo and bar...

The rationale for suggestion 2 is very similar. E.g., the following
expression looks a perfectly legal C++ or Java statement:

return

If the compiler tells me: Undeclared identifier, I may get crazy for the
whole day trying to figure out what's going on... But if tells me Illegal
character U+037E (GREEK QUESTION MARK) after keyword return, then I
immediately understand that something is wrong with that semicolon.

The reason I keep suggestions 1 and 2 separate is that, in the case of
wide, narrow and small compatibility characters, it is trivial to
determine the corresponding regular character, while in the case of
non-Latin punctuation there is room for discussing which punctuation
characters are similar enough (in function or appearance) to which Latin
punctuation character.

For full-width, half-width, and small punctuation characters, my
suggestion is to add the following lines to 4.1 Proposed Pattern
Properties:

FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP
FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION
MARK
FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS
SIGN
FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL
COMMERCIAL AT
FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH
SOLIDUS
FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL
AT
FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE
BRACKET..FULLWIDTH GRAVE ACCENT
FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY
BRACKET..FULLWIDTH TILDE
FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE
PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP
FF64   ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA
FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT
SIGN
FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN
SIGN
FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT
VERTICAL..HALFWIDTH WHITE CIRCLE

For non-Latin punctuation characters, this is my tentative list of
characters that may cause trouble if used in identifiers, and which,
consequently, should be added to class Pattern_Syntax:

037E GREEK QUESTION MARK
0387 GREEK ANO TELEIA
055C ARMENIAN EXCLAMATION MARK
055D ARMENIAN COMMA
055E ARMENIAN QUESTION MARK
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
060D ARABIC DATE SEPARATOR
061B ARABIC SEMICOLON
061F ARABIC QUESTION MARK
066A ARABIC PERCENT SIGN
066B ARABIC DECIMAL SEPARATOR
066C ARABIC THOUSANDS SEPARATOR
06D4 ARABIC FULL STOP
0964 DEVANAGARI DANDA
0965 DEVANAGARI DOUBLE DANDA
10FB GEORGIAN PARAGRAPH SEPARATOR
1362 ETHIOPIC FULL STOP
1363 ETHIOPIC COMMA
1364 ETHIOPIC SEMICOLON
1365 ETHIOPIC COLON
1366 ETHIOPIC PREFACE COLON
1367 ETHIOPIC QUESTION MARK
1368 ETHIOPIC PARAGRAPH SEPARATOR
166E CANADIAN SYLLABICS FULL STOP
1802 MONGOLIAN COMMA
1803 MONGOLIAN FULL STOP
1804 MONGOLIAN COLON
1808

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-22 Thread Marco Cimarosti

Jill Ramonsky wrote:
 Damn. I guess you guys are all going to hate me for asking 
 this, but ...
 what exactly is a mathematical space?

An compatibility space character used only in typesetting mathematics:

205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;compat 0020N;

 PS. I'm going to pre-empt any puns on other interpretations 
 of the phrase mathematical space, [...]

That's unfair! I had a such a nice story about a mathematician who goes in a
space bar and asks for 568.31 ml of... :-)

_ Marco

RE: Proposed Draft UTR #31 - Syntax Characters

2003-08-22 Thread Marco Cimarosti

Mark Davis wrote:
 Technical Report issues would be fine.
 
 I think #1 is worth considering. For #2, see other message to 
 Peter Kirk.

I agree with your statement: The purpose of the Pattern Syntax characters
is *not* to list everything that is a symbol or punctuation mark. But that
is what Kirk suggested, not what I proposed.

I proposed to exclude a *limited* set of script-specific punctuation that
*might* be confused with punctuation characters normally used in the syntax
of computer languages, either because they look identical, or because they
are perceived culturally as another form of the same character.

E.g., I kept out from the list everything belonging to ancient scripts
(who's going to write programs in Linear B!?) and anything that I suspected
would be valid inside a word or expression: hyphens, emphasis markers,
ellipsis marks, etc.

You said that the list of ranges must be invariable, but I doubt that we
will many new *modern* and *commonly* used punctuation marks in future
versions, so think that this requirement for invariability can reasonably be
maintained.

I already made the example of the Greek question mark which may be mistaken
for a semicolon. That is *not* an unlikely situation: if a Greek programmer
has his keyboard in Greek mode (because he just finished typing an
identifier containing Greek letters) he may well forget to turn it to Latin
mode before typing the trailing semicolon.

Similarly, due to the fact that some punctuation characters (parentheses,
etc.) are mirrored in a RTL context, an Arab programmer may think that 
is just an alternate RTL glyph for ?, so he may be puzzled by apparently
absurd error messages.

E.g., he types:

foo?bar

And the system calls routine foo passing variable bar to it. (? is the
call operator of this hypothetical programming language).

So, he switches to Arabic mode and types:



But the system says: Undeclared identifier. But he is *sure* that he did
declare a routine named , and a variable named , so what's going
on?  If the system said instead: Character '' is not a legal operator,
everything would be much clearer.

_ Marco

RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)

2003-08-20 Thread Marco Cimarosti

Peter Kirk wrote:
 [...] I guess English legs tended to be longer than Roman
 ones.

Well, if by English you mean those Germanic barbarians who invaded
Britannia, I guess that the British mile existed way before they set their
feet on the island...

_ Marco

RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)

2003-08-19 Thread Marco Cimarosti

Doug Ewell wrote:
 Shouldn't a pint of beer be administratively fixed at 500 
 mL, just as a fifth of liquor in America is now
 officially 750 mL?  Seems like a good task for an ISO
 working group.

You could generalize it a bit: Alignment Of Metric And Imperial Units Whose
Difference Is So Small As To Be Pointless.

E.g., I never understood why on earth metres and yards should be kept
different. In a public park somewhere in UK or Ireland I have seen the
following sign:

TOILETS ---
  50 yds (45.72 m)

It must be a really urgent need if one cares about those 3.28 metres...

_ Marco

RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)

2003-08-19 Thread Marco Cimarosti

Pim Blokland wrote:
  It must be a really urgent need if one cares about those 3.28
 metres...
 
 4.28 actually.

Ooops.

 But are you serious about lengthening the yard to be the same size
 as the meter?

I was just joking...

 Ha! Fat chance! You might as well suggest we abolish the yard
 altogether!

... What I really meant was this, in fact.

Everybody understands that 50 yards on a sign for a toilet, or 1000
yards on a sign for a filling station are just rough approximation
(especially since they cannot know in advance which closet or hose I will
use). They just mean The toilet is quite close, resist! and Start slowing
down and prepare to turn.

It would be just as fine writing 50 m or 1000 m. Of course, this is if
you *want* to abolish the imperial system and adopt the metre; but this *is*
what UK and Ireland have decided to do.

_ Marco

RE: Handwritten EURO sign (off topic?)

2003-08-14 Thread Marco Cimarosti

Anto'nio Martins-Tuva'lkin wrote:
 On 2003.08.06, 11:12, Philippe Verdy [EMAIL PROTECTED] wrote:
 
  the placement of the currency unit symbol or multiple is language
  dependant, and the same local practices are used with the 
 euro, as the
  one used for pre-euro currencies.
 
 You mean that Dutch should write one euro as 1,- EUR, while Portuguese
 as 1EUR00, and perhaps British as EUR 1.00?... It may be the case, but
 I'd found that a bad idea and worth fighting against.

Why? Different countries always used different characters as decimal or
grouping separators for numbers.

The Italian for one and a half euros is uno virgola cinquanta euro
(where virgola means comma). Should we say comma and write a dot!?

 After all the euro is a common currency and its figures should be
 written in a common way.

Why?

  In fact, the position of the currency unit and decimal separator or
  placement of the negative sign depends mostly of the current locale
  (language/region) and not on the indicated currency, so this
  convention is applied locally for *all* currency units.
 
 Nope, this is not true:

In most cases, it is: amounts in foreign currency are normally formatted
according to local conventions. E.g. a price in US$ on an Italian magazine
would probably be formatted as $2.345,50, not $2,345.50 or 2,345$50¢.

  Using the cent sign is mostly US specific and the symbol is not
  recognized as such in most European countries, so the cent sign is
  bound directly to the dollar.
 
 [...] then I suppose there is a
 theoreitical possiblity that it may be used as a symbol of euro cent
 (though I personally prefer cEUR).

The problem is not *which* symbol to use for cent: it is the concept itself
that cents may need a symbol which is not familiar in most EU countries.

I guess that Ireland is the only euro-zone country where you can see a price
expressed in cents, such as 55 cents. In most other countries of Europe,
the same amount would be expressed as 0.55 euros.

_ Marco

RE: Arabic script web site hosting solution for all platforms

2003-06-18 Thread Marco Cimarosti

Philippe Verdy wrote:
 Excessive cross-posting to multiple newsgroups, forums and list
 servers is considered bulk (and also opposed to the netiquette). As
 this message is targetting a too large audience and out of topic, and
 is also a commercial ad, I can say that bulk+unsollicitated makes it
 fully qualifiable as SPAM.

IMHO, Lateef Sagar's message was perfectly legitimate and absolutely ON
TOPIC for the Unicode List. He didn't try to sell anything but simply
pointed us to a demo page of his (questionable) technology.

During the years, I have seen lots people coming here to announce new
releases of their Unicode-related products, and no one called them
spammers.
 
 This particularly true because this message calls no other response
 than visiting a third-party web site. This is not discussion, 
 but a proposal of service.

Although no comments or criticism was explicitly requested, the context (the
Unicode List and the other forums related to encoding or fonts) and the
demonstrative nature of the page pointed to may be considered an implicit
request for technical comments and discussion.

 I will complain to Yahoo for your abuse of its AUP in this webmail.

Sorry, but I have to complain with Sarasvati for your threatening manners.

_ Marco

RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK

2003-06-06 Thread Marco Cimarosti

Rob Mount
 Q1: Can a character be both alphabetic and diacritic?

I would say yes. My understanding of the Lm general category is: a
diacritic letter.

 Q2: Is there a difinitive answer as to whether this is an alphabetic 
 character?

Strictly speaking, as katakana and hiragana are not alphabets, their letters
cannot be called alphabetic.

But I guess that you mean alphabetic is the sense that isalpha() should
return TRUE for it, i.e. in the sense that it is a character used to write
*words* in the orthography of some language. In this sense, yes, IMHO:
isalpha() should return true for all the characters having general category
L

_ Marco

RE: Tamazight/berber language : How to send mail, write word documents ....

2003-06-06 Thread Marco Cimarosti

Chris Jacobs wrote:
 Depends on how much text you need.
 If it is just a few words then getting an unipad from
 http://www.unipad.org/ would be enough.
 
 You can copy and paste the chars from it.
 
 If this is not enought than have a look at
 http://www.tavultesoft.com/keyman/

BTW, Unipad also has its built-in keyboard editor. But, of course, that
would only be usable within Unipad.

_ Marco

RE: Tamazight/berber language : How to send mail, write word documents ....

2003-06-06 Thread Marco Cimarosti

Philippe Verdy wrote:
 However the interesting part of your question for discussion 
 in this list is:
 - Which Unicode character should be used to encode the 
 spacing ring? (may conflict with the degree sign, or a 
 upscript small letter O)
 - Should you use a Greek Gamma or a Latin Gamma, and a Greek epsilon?

Of course, one should avoid mixing up different scripts in the same
orthography. So, I'd suggest:

- abbud (epsilon):  U+0190 (LATIN CAPITAL LETTER OPEN E), 
U+0259 (LATIN SMALL LETTER OPEN E)

- arum (gamma):  U+0194 (LATIN CAPITAL LETTER GAMMA),   U+0263
(LATIN SMALL LETTER GAMMA)

 Your table also displays only the lowercase letters. It would 
 be interesting to show the associated uppercase letters.

The only dubious case in the table is uppercase arum, which could have the
Greek shape. But it's unlikely.

_ Marco

PUA usage (was RE: Announcement: New Unicode Savvy Logo)

2003-05-31 Thread Marco Cimarosti

[OOOPS! This works better if I set the proper MIME encoding... Sorry]


Philippe Verdy wrote:
 This contrasts a lot with the Unicode codepoints assigned to 
 abstract characters, that are processable out of any 
 contextual stylesheet, font or markup system, where its only 
 semantic is in that case private use with no linguistic 
 semantic and no abstract character evidence, and all with the 
 same default character properties (including shamely the bidi 
 properties needed to render and layout the fonted text, 

In HTML, the default directionality of characters can be overridden with the
BDO tag. E.g.:

BDO dir=rtlhi!/BDO

This should displays as a RTL string, with ! on the left side and h on
the right side.

The same can be achieved also in plain-text Unicode, using RLO, LRO and PDF:

hi!

(U+202E U+0068 U+0069 U+0021 U+202C)

Ciao.
Marco

PUA usage (was RE: Announcement: New Unicode Savvy Logo)

2003-05-31 Thread Marco Cimarosti

Philippe Verdy wrote:
 This contrasts a lot with the Unicode codepoints assigned to 
 abstract characters, that are processable out of any 
 contextual stylesheet, font or markup system, where its only 
 semantic is in that case private use with no linguistic 
 semantic and no abstract character evidence, and all with the 
 same default character properties (including shamely the bidi 
 properties needed to render and layout the fonted text, 

In HTML, the default directionality of characters can be overridden with the
BDO tag. E.g.:

BDO dir=rtlhi!/BDO

This should displays as a RTL string, with ! on the left side and h on
the right side.

The same can be achieved also in plain-text Unicode, using RLO, LRO and PDF:

?hi!?

(U+202E U+0068 U+0069 U+0021 U+202C)

Ciao.
Marco

RE: The role of country codes/Not snazzy

2003-05-30 Thread Marco Cimarosti

Brian Doyle wrote:
 on 5/29/03 9:15 AM, Marion Gunn at [EMAIL PROTECTED] wrote:
  When a reference to using embryonic ISO 639-3 to 
  'legitimize' SIL's flawed
  Ethnologue is let pass with no comment
 
 Why is Ethnologue flawed?

And how is this more on-topic on a mailing list called Unicode List than
discussing a banner offered by the _Unicode_ Consortium to flag web pages
encoded in _Unicode_?

_ Marco

RE: Not snazzy (was: New Unicode Savvy Logo)

2003-05-29 Thread Marco Cimarosti

Philippe Verdy wrote:
 Savvy is better understood in this context as aware, than 
 archaic or informal in your English-Italian dictionnary. 

No, archaic, American and informal are usage labels, not translations.
The translation is buon senso. (BTW, it is: Dizionario Garzanti di
inglese, Garzanti Editore, 1997, ISBN 88-11-10212-X)

_ Marco

RE: Not snazzy (was: New Unicode Savvy Logo)

2003-05-29 Thread Marco Cimarosti

Rick McGowan wrote:
 2. It is unikely that the Unicode *logo* itself (i.e. the thing at 
 http://www.unicode.org/webscripts/logo60s2.gif) will be incorporated 
 directly in any image that people are allowed to put on their 
 websites, because to put the Unicode logo on a product or whatever
 requires a license agreement. I.e. the submissions from E. Trager
 are out of scope because they contain the Unicode logo on the
 left side.

As this comes from an Unicode official, I guess we should simply accept
it... Nevertheless, I wonder whether displaying the Unicode *logo* per se
has the same legal implication as displaying a *banner* which contains the
Unicode logo.

IMVHO, that seems like the difference between producing a T-shirt with the
Unicode logo and wearing it. In the first case, I must demonstrate that I
asked and obtained the permission from the trade-mark owner; in the second
case, I don't have to demonstrate anything (apart, maybe, that I did not
steal that piece of garment).

_ Marco

RE: Not snazzy (was: New Unicode Savvy Logo)

2003-05-28 Thread Marco Cimarosti

Andrew C. West wrote:
 I agree with Philippe on this one. A sensible, and easily 
 understandable, motto
 like The world speaks Unicode would be much better. The 
 word savvy just
 sends a shiver of embarrasment down my spine. Not only is 
 savvy not a word
 that is probably high in the vocabulary list of non-English 
 speakers, but I
 don't think many native English speakers would ever use it by 
 choice (maybe it's
 just me, but I really loathe the word).

Yes, you are right. I never heard the word savvy before this morning.

My English-Italian dictionary has two savvy entries: an adjective (labeled
fam. amer. = US English, informal) and a noun (labeled antiq. / fam. =
archaic or informal). However, all the translations have to do with
common sense, and none of them seems to explain the intuitive meaning of
Unicode savvy, which I guess is supposed to be: Unicode enabled,
Unicode supported, encoded in Unicode, etc.

Another i18n problem is the lettering: the unusual legation of the first
three letters and the mix-up of upper- and lower-case forms can make the
text completely unintelligible to people not familiar with handwritten forma
of the Latin alphabet. I guess that many people would wonder in what strange
alphabet Unicode is written COD.

About the V-shaped tick in the square, that is so deformed and stylized that
it might be hard to recognize. Keep in mind that this symbol is quite
English-specific; in many parts of the world, different signs are used to
tick squares on paper forms (e.g., X, O, a filled square, etc.). The
English-style tick is only seen on GUI interfaces like Windows, Mac, etc.

I also share the concerns about colors: beside their ugliness (I would have
never imagined that that curious yellow could be called pink), they fail
to recall the red and white of the well-know Unicode logo. If I didn't know
it before seeing them, I would never have associated those icons with the
Unicode standard or the Unicode Consortium.

My humble suggestions would be:

1) Replace the semi-dialectal Unicode savvy with a clearer motto (such as
encoded in Unicode, or the other phrases suggested by others); possibly,
check that all the words used are in the high-frequency part of the English
lexicon.

2) Use the regular squared Unicode logo which is seen in the top-left corner
of the Unicode web site. That's already famous and immediately hints to
Unicode.

3) Compose the motto (*including* the word Unicode) in an widespread and
well-readable typeface, in black or un one of the colors of the Unicode
logo.

4) Make the V tick sign as similar as possible to a square root symbol,
because that is the glyph which has been popularized by GUI interfaces.

Ciao.
Marco

RE: Exciting new software release!

2003-04-01 Thread Marco Cimarosti

Doug Ewell wrote:
 Drop everything and check out a kewl new Windows program available at:
 
 http://users.adelphia.net/~dewell/mathtext.html

𝔬𝔱𝔣𝔩!

_ Marco

RE: Several BOMs in the same file

2003-03-25 Thread Marco Cimarosti

Stefan Persson wrote:
 Let's say that I have two files, namely file1  file2, in any Unicode 
 encoding, both starting with a BOM, and I compile them into 
 one by using
 
 cat file1 file2  file3
 
 in Unix or
 
 copy file1 + file2 file3
 
 in MS-DOS, file3 will have the following contents:
 
 BOM
 contents from file1
 BOM
 contents from file2
 
 Is this in accordance with the Unicode standard, or do I have 
 to remove the second BOM?

IMHO, Unicode should not specify such a behavior. Deciding what a shell
command is supposed to do is a decision of the operating system, not of text
encoding standards.

BTW, consider that both Unix cat and DOS copy are not limited to Unicode
text files. Actually, they are not even limited to text files at all: you
could use them to concatenate a bitmap with a font with an HTML document
with a spreadsheet... whether the result makes sense or not is up to you
and/or to the applications that will process the resulting file.

Probably, there should be two separate commands (or different options of the
same command): to do a raw byte-by-byte concatenation, and to do an
encoding-aware concatenation of text files.

E.g., imagine a cat command with these extensions:

Synopsis
cat [ -... ] [ -R encoding ] { [ -F encoding ] file }
Description:
...
If neither -R or -F's are specified, the concatenation is
done byte by byte.
Options:
...
-R  specifies the encoding of the resulting *text* file;
-F  specifies the encoding of the following *text* file.

You command above would now expand to something like this:

cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2  file3

Provided with information about the input encodings and the expected output
encoding, cat could now correctly handle BOM's, endianness, new-line
conventions, and even perform character set conversions. Without this extra
info, cat would retain its good ol' byte-by-byte functionality.

Similar options could be added to any Unix command potentially dealing with
text files (cp, head, tail, etc.), as well as to their equivalents in
DOS or other operating systems.

_ Marco

RE: Several BOMs in the same file

2003-03-25 Thread Marco Cimarosti

I (Marco Cimarosti) wrote:
 As a minimum, option -v must know the semantics of NL and 
 LF control codes, of the digits, and the of white space.

Sorry, I meant: option -n.

_ Marco

RE: Several BOMs in the same file

2003-03-25 Thread Marco Cimarosti

Kent Karlsson wrote:
  I'm not going into the implementation part; just pointing out that
  this issue is not something an operating system can ignore.
 
 cat and cp can and shall ignore it.  They are octet-level
 file operations, attaching no semantics to the octets.  Try iconv.

This byte-level operation is the just the default behavior. This basic
behavior should remain the default, of course.

However, there already are a lot of options specific to text files, that
*do* attach character semantics to octets, such as the -n option to number
output lines:

http://www.hmug.org/man/1/cat.html

As a minimum, option -v must know the semantics of NL and LF control
codes, of the digits, and the of white space.

There is no technical reason for not adding more options to act more
sensibly with the encoding(s) of the involved text file(s). Again, any such
text-specific option must be disabled by default, in order to preserve the
basic byte-by-byte operation.

_ Marco

RE: List of ligatures for languages of the Indian subcontinent.

2003-03-18 Thread Marco Cimarosti

Kenneth Whistler wrote:
 Dream on. The information needed exists in books and other
 reference source in libraries, book shops, and other collections
 across India -- and, for that matter, around the world. It is
 merely a matter of collecting the relevant information and
 distilling it into succinct, yet complete, statements of the
 relevant information needed for proper typographic practice
 for each script, for each style of each script, for each local
 typographic tradition for each style, and so on.

A couple of hints for William and other people interested in this issue:

-   Akira Nakanishi, Writing Systems of the World -- Alphabets,
Syllabaries, Pictograms, Tuttle 1980(1999), ISBN 0804816549.
This is charming little book explores all the scripts used in the
world today, giving for each one of them a table of all the signs (apart
Chinese, of course) and an explanation of how the script works. For each
script, it also reproduces a page from a daily newspaper written in that
scripts. The information is not always 100% accurate, however the book
remains an invaluable introduction to the scripts of the world, and a
perfect complement to the reading of the Unicode Standard.

-   The grammars in the National Integration Series by Balaji
Publications, Madras, India.
Each grammar in this series is a small A5-format book bearing a
title like: Learn language name in 30 Days through English. The grammars
are not very valid by the linguistic point of view (it's unlikely that the
reader will actually learn an Indian language in one month!), but they all
have a very interesting introduction to the script used by each language,
which also normally includes a table of all the combinations of
consonant+vowel, and a table of the essential consonant clusters, and of
half or subjoined consonants. If you compare the grammars of languages
sharing the same script (such as Sanskrit, Hindi, and Marathi, all written
with the Devanagari script), you can verify how the list of required
ligatures varies from a language to another. Notice that also these books
are far from being 100% accurate.

All the above books have low price and are easily found in bookshops in the
UK and elsewhere.

Another good source for making a lists of required glyphs are the existing
non-Unicode fonts for Indic languages. The nicest free collection I have
seen so far is the Akruti GNU TrueType fonts, which contains a set of glyphs
appropriate for most modern usages:

http://www.akruti.com/freedom/

_ Marco

RE: Need encoding conversion routines

2003-03-14 Thread Marco Cimarosti

askq1 askq1 wrote:
 From: Pim Blokland [EMAIL PROTECTED]
 
 However, you have said this is not what you want!
 So what is it that you do want?
 
 I want c/c++ code that will give me UTF8 byte sequence 
 representing a given code-point,
 UTF16 16 bits sequence reppresenting a given 
 code-point, UTF32 
 32 bits sequence representing a given code-point.
 
 e.g.
 
 UTF8_Sequence CodePointToUTF8(Unichar codePoint)
 {
 //I need this code
 }
 
 UTF16_Sequence CodePointToUTF16(Unichar codePoint)
 {
 //I need this code
 }
 
 UCS2_Sequence CodePointToUCS2(Unichar codePoint)
 {
 //I need this code
 }

Hint:

#include ConvertUTF.h
typedef UTF32 Unichar;
typedef UTF8  UTF8_Sequence  [4 + 1];
typedef UTF16 UTF16_Sequence [2 + 1];
typedef UTF16 UCS2_Sequence  [1 + 1];

_ Marco

RE: Need encoding conversion routines

2003-03-12 Thread Marco Cimarosti

askq1 askq1 wrote:
 I want c/c++ functions/routines that will convert Unicode to 
 UTF8/UTF16/UCS2 encodings and vice-versa. Can some-one point
 me where can I get these code routines?

Unicode's reference implementation is here, but I don't know how much
up-to-date it is with some tiny changes in UTF-8:

http://www.unicode.org/Public/PROGRAMS/CVTUTF/

You can also see the UTF functions provided by ICU, an open source Unicode
library:

http://oss.software.ibm.com/icu/

_ Marco

RE: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Marco Cimarosti

Kenneth Whistler wrote:
 [...]
 Of course, further weight corrections need to be applied if reading
 the standard *below* sea level or in a deep cave.

I hope it will not be consider pedantic to observe that the mass or weight
of a book do not change depending on whether someone is reading it or not.
Consequently, the same weight corrections need to be applied also if someone
*throws* the standard in a deep cave.

_ Marco

1 2 3 4 5 6 7 8 >

1 - 100 of 708 matches

Mail list logo