Re: U+0BA3, U+0BA9

2003-10-25 Thread Asmus Freytag
At 02:08 PM 10/25/03 -0700, Doug Ewell wrote:
> So, in effect the UNICODE character names attempt to be
> a unified transliteration scheme for all languages? Are these
> principles laid down somewhere or is this more informal?
The Unicode character names attempt to be (a) unique and (b) reasonably
mnemonic.  Anything beyond that is a bonus.  They expressly do *not*
represent any form of transliteration or transcription scheme.
However, it is sometimes forgotten that the standard is intended to be
in English (with the possibility of translation to other languages,
for example the French translation that has been carried out for 3.2).
If a character has an obvious or common English name, that name should be
used. Where there is no obvious English name, a transliteration or
transcription of the native name makes sense.
In the case of a script used by multiple languages, it's an interesting
question which language wins out. Assume you have a majority language that
doesn't use a certain character, but has a word for it. Does it make more
sense to keep all transscriptions in the same language in Unicode character
names?
Opinions differ.

Ultimately the only strong requirements are that names are unique and
(recently added) that dropping common words such as LETTER, MARK, SIGN
and SYMBOL as well as spaces, and hyphens do not affect that uniqueness.
Since the character names freeze mistakes permanently and since the committee
decisions have resulted in some odd and not always consistent approaches to
naming, some of the translated sets of character names are more consistent
and usable than the official English.
That has led to the suggestion of eventually creating a translation of the
character names into e.g. American English, essentially providing a set of
consistent aliases that might be useful for dictionaries of character names
exposed to end users interested in locating characers, as opposed to merely
wanting the formal, but potentially arbitrary reference.
A./



RE: CGJ - Combining Class Override

2003-10-25 Thread Jony Rosenne
Sorry, Philippe, I had meant a separate character for a "right Meteg", not a
separate control character. Does this mean we agree?

Jony

> -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED] 
> Sent: Saturday, October 25, 2003 5:58 PM
> To: Jony Rosenne
> Cc: [EMAIL PROTECTED]
> Subject: Re: CGJ - Combining Class Override
> 
> 
> From: "Jony Rosenne" <[EMAIL PROTECTED]>
> 
> > For the record, I repeat that I am not convinced that the CGJ is an 
> > appropriate solution for the problems associated with the 
> right Meteg. 
> > I tend to think we need a separate character.
> 
> Yes, it's possible to devize another character explicitly to 
> override very precisely the ordering of combining classes. 
> But this still does not change the problem, as all the 
> existing NF* forms in existing documents using any past or 
> present version of Unicode MUST remain in NF* form with 
> further additions.
> 
> If one votes for a separate control character, it should come 
> with precise rules describing how such override can/must be 
> used, so that we won't break existing implementations. This 
> character will necessary have a combining class 0, but will 
> still have a preceding context. Strict conformance for the 
> new NF* forms must still obey to the precise ordering rules, 
> and this character, whatever its form, shall not be used 
> everytime it is not needed, i.e. when the existing
> NF* forms still produce the correct logical order (that's why 
> its use should then be restricted to a list of known 
> combining characters that may need this override).
> 
> Call it  "Combining Class Override" ? This does not 
> change the problem: this character should be used only 
> between pairs of combining characters, such as the encoded sequence:
> {c1, CCO, c2}
> shall conform to the rules:
> (1) CC(c1) > CC(c2) > 0,
> (2) c1 is known (listed by Unicode?) to require this override
> to keep the logical ordering needed for correct text semantics.
> 
> The second requirement should be made to avoid abuses of this 
> character. But it is not enforceable if CGJ is kept for this function.
> 
> The CCO character should then be made "ignorable" for
> collation or text breaks, so that collation keys will become:
> [ CK(c1), CK(c2) ]  for {c1, CCO, c2}
> [ CK(c2), CK(c1) ]  for {c2, c1} and {c1, c2} if normalized
> 
> Legacy applications will detect a separate combining sequence 
> starting at CCO, but newer applications will still know that 
> both sequences are describing a single grapheme cluster.
> 
> This knowledge should not be necessary except in grapheme 
> renderers, or in some input methods that will allow users to
> enter:
> (1) keys  producing the normalized text {c2, c1}
>  as before;
> (2) keys  producing the normalized text {c1, CCO, c2}
>  instead of {c2, c1} as before;
> (3) optionally support a keystroke or selection system to swap
>  combining characters.
> 
> If this is too complex, the only way to manage the situation 
> is to duplicate existing combining characters that cause this 
> problem, and I think this may go even worse as this 
> duplication may need to be combinatorial and require a lot of 
> new codepoint assignments.
> 
> 
> 





Re: Traditional dollar sign

2003-10-25 Thread Philippe Verdy
From: "Simon Butcher" <[EMAIL PROTECTED]>

> Hi!
>
> Just a quick question.. The description for U+0024 (DOLLAR SIGN) states
that the glyph may contain one or two vertical bars. Is there a codepoint
specifically for the traditional double-bar form, or any plan to include one
in the future?
>
> I was taught at school that the double-bar form was used when Australia
switched to decimal currency in 1966, and that it was incorrect to write the
single-bar form when referring to Australian dollars. I guess the single-bar
form had taken over due to the lack of support from type-faces and computing
devices, although it's still quite common to see it in Australian
publications, especially in large fonts (headlines, advertising, etc).

There's a similar consideration in French primary schools about the correct
way to draw the decimal digits: the handwritten barred form of digit seven
is mandatory to avoid confusion with the handwritten digit one, and the
"uppercase L with stroke" and "zigzag" forms of digit four are also
prohibited. In school books, they are shown correctly, but this rule is
rapidly forgotten when children are used to correctly draw digits easy to
differentiate.




Re: Traditional dollar sign

2003-10-25 Thread Philippe Verdy
From: "Peter Kirk" <[EMAIL PROTECTED]>
> I wonder how long before the Euro will also de facto have a single bar?

This is already done since the birth of the symbol, when some legal texts
specify that (if nothing else) a uppercase letter E can be used in
environments that don't support the exact initial euro symbol design.

And in fact I can see now a lot more variants of the symbols in ads and
other commercial displays, using one of the many forms that have appeared
for that symbol.

And I am myself handwriting it sometimes with a single bar, which sometimes
looks just like a tall&wide lowercase e in which the single bar touches the
top right corner of a slanted curve, simply because I usually draw the
horizontal stroke before this curve, forgetting to draw the second bar or
drawing it too often on top of the first bar.

If there are effectively semantic differences between a single-bar and
double-bar glyph for the dollar in Australia, New Zealand or other countries
using this symbol, and and the glyph for the US dollar, the variant may be
the best solution to represent them (letting users select a font that makes
this distinction). I bet it will be exceptional.




Re: Merging combining classes, was: New contribution N2676

2003-10-25 Thread Philippe Verdy
From: "Peter Kirk" <[EMAIL PROTECTED]>

> I can see that there might be some problems in the changeover phase. But
> these are basically the same problems as are present anyway, and at
> least putting them into a changeover phase means that they go away
> gradually instead of being standardised for ever, or however long
> Unicode is planned to survive for.

I had already thought about it. But this may cause more troubles in the
future for handling languages (like modern Hebrew) in which those combining
classes are not a problem, and where the ordering of combining characters is
a real bonus that would be lost if combining classes are merged, notably for
full text searches where the number of order combinations to search could
explode, as the effective order in occurences could become unpredictable for
searches.

Of course, if the combining class values were really bogous, a much simpler
way would be to deprecate some existing characters, allowing new
applications to use the new replacement characters, and slowly adapt the
existing documents with the replacement characters whose combining classes
would be more language-friendly.

This last solution may seem better but only in the case where a unique
combining class can be assigned to these characters. As one said previously
in this list, there are languages in which such axiome will cause problems,
meaning that, with the current model, those problematic combining characters
would have to be encoded with a null combining class, and linked to the
previous combining sequence using either a character property (for its
combining behavior in grapheme clusters and for rendering) or a specific
joiner control (ZWJ ?) if this property is not universal for the character.

> It isn't a problem for XML etc as in such cases normalisation is
> recommended but not required, thankfully.

In practive, "recommanded" will mean that many processes will perform this
normalization, as part of their internal job, so it would cause
interoperability problems if the result of this normalization is further
retreived by the unaware client that submitted the data to that service
which is supposed to keep the normalization identity of the string.

Also I have doubts about the validity of this change face to the stability
pact signed between Unicode and the W3C for XML.

> As for requirements that lists
> are normalised and sorted, I would consider that a process that makes
> assumptions, without checking, about data received from another process
> under separate control is a process badly implemented and asking for
> trouble.

Here the problem is that we will not always have to manage the case of
separate processes, but also the case of utility libraries: if this library
is upgraded separately, the application using it may start experimenting
problems. e.g. I am thinking about the implied sort order in SQL databases
for table indices: what would happen if the SQL server is stopped just the
time to upgrade a standard library implementing the normalization among many
other services, because another security bug such as a buffer overrun is
solved in another API? When restarting the SQL server with the new library
implementing the new normalization, nothing would happen, apparently, but
the sort order would no more be guaranteed, and stored sorted indices would
start being "corrupted", in a way that would invalidate binary searches
(meaning that some unique keys could become duplicated, or not found,
producing unpredictable results, critical if they are assumed for, say, user
authentication, or file existence).

Of course such upgrade should be documented, but as this would occur in very
intimate levels of a utility library incidentally used by the server. Will
all administrators and programmers be able to find and know all the intimate
details of this change, when Unicode has stated to them that normalized
forms should never change? Will it be possible to scan and rebuild the
corrupted data with a check&repair tool, if the programmers of this system
assumed that the Unicode statement was definitive and allowed performing
such assumptions to build optimized systems?

When I read the stability pact, I can conclude from it that any text valid
and normalized in one version of Unicode will remain normalized in any
version of Unicode (including previous ones) provided that the normalized
strings contain characters that were all defined in the previous version.
This means that there's a upward _and_ backward compatibility of encoded
strings and normalizations on their common defined subset (excluding only
characters that have been defined in later versions but were not assigned in
previous versions).

The only thing that is allowed to change is the absolute value of non-zero
combining classes (but in a limited way, as for now they are limited to a
8-bit value range also specified in the current stability pact with the XML
working group), but not their relative order: merging neighbouring classes
w

RE: Traditional dollar sign

2003-10-25 Thread Asmus Freytag
At 11:02 AM 10/26/03 +1100, Simon Butcher wrote:

Hi!


> >I was taught at school that the double-bar form was used
> when Australia
> >switched to decimal currency in 1966, and that it was
> incorrect to write
> >the single-bar form when referring to Australian dollars.
>
> It would be interesting if you could document that.
That could be tough :) Literature shown to me was at school (many years
ago), and digging it up would be difficult. It's widely known that the
double-bar form does exist, though, at least!
But we knew that.


> >I guess the single-bar form had taken over due to the lack
> of support from
> >type-faces and computing devices, although it's still quite
> common to see
> >it in Australian publications, especially in large fonts (headlines,
> >advertising, etc).
>
> It looks like actual practice is what you describe: the free
> alternation
> between the form without change in meaning.
>
> If we were to add a code point we would get into the
> situation that the
> free alternation would suddenly become a matter of content
> difference (not
> just a choice in presentation). In other cases where the
> majority of users
> freely alternate, but there is indication that some subset of
> users need to
> maintain a form distinction we have used standardized
> variants. This has
> been done mostly for mathematical symbols.

I understand, although couldn't that same argument be used against many
of the characters in the 'Dingbats' section, such as the ornamental
variations of exclamation marks, quotation marks, and so forth? I do
realise these come from an existing character set, but there are indeed
still users of the double-bar form. Even my Concise Oxford Dictionary is
printed using the double-bar form (under the term, 'dollar').
If their font uses that other shape, that's what they get. Only when the
distinction is required, (as demonstrated in actual use, not just what you
get taught in school) should we disunify.
I just thought it extremely odd that a character which is still in
common (albeit admittedly waning) use is not included in the set. Peter
Kirk made a valid observation with regards to the Lira symbol (U+20A4)
which Unicode admits often has U+00A3 (Pound sign) used in its place,
with the only difference being a double-bar on U+20A4.
I've never seen a widely used font with both symbols in it. That alone suggests
that the unification is correct. For the case of the Lira, I plead ignorance on
the specific justification (and whether I would have considered it important).
The fact is that the source for it is buried in the early drafts of Unicode,
probably predating my involvement - so the only thing I can note is that 
TUS 4.0
points out that 00A3 should be used (i.e. suggests a defacto unification in
recommended use).

A./



RE: Traditional dollar sign

2003-10-25 Thread Simon Butcher

Hi!


> >I was taught at school that the double-bar form was used 
> when Australia 
> >switched to decimal currency in 1966, and that it was 
> incorrect to write 
> >the single-bar form when referring to Australian dollars.
> 
> It would be interesting if you could document that.

That could be tough :) Literature shown to me was at school (many years
ago), and digging it up would be difficult. It's widely known that the
double-bar form does exist, though, at least!

> >I guess the single-bar form had taken over due to the lack 
> of support from 
> >type-faces and computing devices, although it's still quite 
> common to see 
> >it in Australian publications, especially in large fonts (headlines, 
> >advertising, etc).
> 
> It looks like actual practice is what you describe: the free 
> alternation 
> between the form without change in meaning.
> 
> If we were to add a code point we would get into the 
> situation that the 
> free alternation would suddenly become a matter of content 
> difference (not 
> just a choice in presentation). In other cases where the 
> majority of users 
> freely alternate, but there is indication that some subset of 
> users need to 
> maintain a form distinction we have used standardized 
> variants. This has 
> been done mostly for mathematical symbols.


I understand, although couldn't that same argument be used against many
of the characters in the 'Dingbats' section, such as the ornamental
variations of exclamation marks, quotation marks, and so forth? I do
realise these come from an existing character set, but there are indeed
still users of the double-bar form. Even my Concise Oxford Dictionary is
printed using the double-bar form (under the term, 'dollar').

I just thought it extremely odd that a character which is still in
common (albeit admittedly waning) use is not included in the set. Peter
Kirk made a valid observation with regards to the Lira symbol (U+20A4)
which Unicode admits often has U+00A3 (Pound sign) used in its place,
with the only difference being a double-bar on U+20A4.

Cheers,

 - Simon




Re: Unicode and Script Encoding Initiative in San Jose Mercury News

2003-10-25 Thread Eric Muller


Doug Ewell wrote:

[...] about "You see, boys and girls, computers think only in numbers" 
-- in a Silicon Valley paper,


[...] Should we tell them about “real” quotes?
“real quotes” are not just for Web publication; they are also for email. 
Throw in real dashes, of the kind – en or em – you prefer

Eric.
8-)




Re: New contribution N2676

2003-10-25 Thread Raymond Mercier
>Should we continue to encode this as ARTABE SIGN and just note the use of
> this shape for 'zero' in an annotation?
> Should we change it to another name and add the annotation for 'artabe'?>
> Should we take any other actions?



Well I don't quite know. My real intrest is in the changing shape of the
zero, but I am not yet ready with a proposal.

Besides in the papyri where Kenyon read Artable this symbol is much of the
time coupled with another, the two written rather cursively together in the
papyri. Kenyon carefully records all the different forms, and after seeing
that I am in some doubt about what exactly should be encoded. I suspect that
the new list is based not on the many many symbols given by Kenyon in his
many volumes of transcribed papyri, but on a summary list that he published
before that.
I wish I could  be more definite.

Raymond


- Original Message - 
From: "Asmus Freytag" <[EMAIL PROTECTED]>
To: "Raymond Mercier" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Saturday, October 25, 2003 8:26 PM
Subject: Re: New contribution N2676


>
> At 05:51 PM 10/25/03 +0100, Raymond Mercier wrote:
> >  Among the new characters in N2676 there is
> >
> >  10186 G GREEK ARTABE SIGN
> >
> >  This is one of the many signs found in papyri, such as those edited by
> >Kenyon. This symbol represents apparently a measure of volume used for
> >grain. It appears as a small circle, smaller than omicron, with a long
> >overline, much longer than a macron.
> >
> >  While I have been looking for the various forms of the symbol for zero
I
> >find in other papyri quite exacty the same character used for 'zero'. I
make
> >this comparison after studying many photographs of papyri, those provided
> >with Kenyon's editions on the one hand, and on the other, Alexander
Jones'
> >recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus.
> >  The attached image is take from Jones, part of a column of zeroes
written
> >this way.
>
> This is fascinating information.
>
> However, I'm unclear what you propose.
>
> Should we continue to encode this as ARTABE SIGN and just note the use of
> this shape for 'zero' in an annotation?
>
> Should we change it to another name and add the annotation for 'arabe'?
>
> Should we take any other actions?
>
> A./




Re: U+0BA3, U+0BA9

2003-10-25 Thread Peter Kirk
On 25/10/2003 14:08, Doug Ewell wrote:

Peter Jacobi  wrote:

 

So, in effect the UNICODE character names attempt to be
a unified transliteration scheme for all languages? Are these
principles laid down somewhere or is this more informal?
   

The Unicode character names attempt to be (a) unique and (b) reasonably
mnemonic.  Anything beyond that is a bonus.  They expressly do *not*
represent any form of transliteration or transcription scheme.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/




 

If you think the Tamil is misleading, look at the Cyrillic. The same 
sound is written as I in 0415, Y in 042E and J in 0408.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Unicode and Script Encoding Initiative in San Jose Mercury News

2003-10-25 Thread Doug Ewell
Deborah W. Anderson  wrote:

> The Business section in today's San Jose Mercury News (Friday, Oct.
> 24) has a story on Unicode and the Script Encoding Initiative:
> http://www.bayarea.com/mld/mercurynews/business/7092371.htm

Nice article.  Good to see some mainstream publicity for this worthy
effort.

My eyes rolled waaay up when I got to the part about "You see, boys and
girls, computers think only in numbers" -- in a Silicon Valley paper,
yet!  But I guess this did appear in the Business section, not the
Technology section.

On the typographical dark side, it was quite discouraging to see ``this
horrible quoting convention'' in a Web publication of an article about
Unicode.  Should we tell them about “real” quotes?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: transliteration in java

2003-10-25 Thread Mark Davis



Check out ICU4J (http://oss.software.ibm.com/icu4j/). 
There is a demo of transliteration at http://oss.software.ibm.com/cgi-bin/icu/tr. 
For Cyrillic, we currently only do an ISO-based transliteration, but you can do 
your own custom ones.
 
(The demo will store custom rules that people have devised. I 
see that there are a couple of Cyrillic ones, as well as a number of ones we 
don't have in the stock ICU, such as American/Canadian Indian 
transliterators.)
Mark__http://www.macchiato.com► 
शिष्यादिच्छेत्पराजयम् ◄ 

  - Original Message - 
  From: 
  Dennis N. Stetsenko 
  To: [EMAIL PROTECTED] 
  Sent: Sat, 2003 Oct 25 11:25
  Subject: transliteration in java
  
  
  Hello
   
  My apologies if such kind of 
  question is too silly, but I browse quickly through resources\FAQ and did not 
  find anything useful for me…
   
  I’m having bunch of files that are 
  in Cyrillic charset and I need to transfer then to some device that is not 
  capable to show such carset (don’t have appropriate font).
   
  So, I’ve decided to provide 
  transliteration mechanism, i.e. convert chars from Cyrillic to Latin. The 
  language that I’m going to use is Java. 
  Can you guys point me on some 
  useful resource to do so or give me some recommendation?
  =
  I’ve made some preliminary 
  prototyping, and results appear to be weird. 
  1 I provide a mapping from a char 
  (lets say Cyrillic) to its Latin equivalent in sense of transliteration 
  
  2 Take the flat file and process 
  it (convert from Cyrillic to Latin)
   
  Sometimes its working, sometimes 
  its not… 
  Apparently when I run simple 
  things from my IDE it works fine, but when I’m trying to do the same in 
  standalone mode – it skips processing.
  I was hunting down the problem and 
  this is the difference I see:
   
  When I do call like this 
  Character.UnicodeBlock.of(toProcess) for next char to transliterate, it shows 
  
  From IDE - 
  CYRILLIC
  Standalone - 
  LATIN_1_SUPPLEMENT
   
  So, I guess the way flat file is 
  read makes big difference… I’m willing to blame some difference in system 
  properties settings for to such calls…
  Can you help me with pointers to 
  make it the way it should be?
   
  Thanks, 
  Dennis


Re: U+0BA3, U+0BA9

2003-10-25 Thread Doug Ewell
Peter Jacobi  wrote:

> So, in effect the UNICODE character names attempt to be
> a unified transliteration scheme for all languages? Are these
> principles laid down somewhere or is this more informal?

The Unicode character names attempt to be (a) unique and (b) reasonably
mnemonic.  Anything beyond that is a bonus.  They expressly do *not*
represent any form of transliteration or transcription scheme.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: New contribution N2676

2003-10-25 Thread Asmus Freytag
At 05:51 PM 10/25/03 +0100, Raymond Mercier wrote:
 Among the new characters in N2676 there is

 10186 G GREEK ARTABE SIGN

 This is one of the many signs found in papyri, such as those edited by
Kenyon. This symbol represents apparently a measure of volume used for
grain. It appears as a small circle, smaller than omicron, with a long
overline, much longer than a macron.
 While I have been looking for the various forms of the symbol for zero I
find in other papyri quite exacty the same character used for 'zero'. I make
this comparison after studying many photographs of papyri, those provided
with Kenyon's editions on the one hand, and on the other, Alexander Jones'
recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus.
 The attached image is take from Jones, part of a column of zeroes written
this way.
This is fascinating information.

However, I'm unclear what you propose.

Should we continue to encode this as ARTABE SIGN and just note the use of 
this shape for 'zero' in an annotation?

Should we change it to another name and add the annotation for 'arabe'?

Should we take any other actions?

A./



transliteration in java

2003-10-25 Thread Dennis N. Stetsenko








Hello

 

My apologies if such kind of question is too silly,
but I browse quickly through resources\FAQ and did not find anything useful for
me…

 

I’m having bunch of files that are in Cyrillic charset
and I need to transfer then to some device that is not capable to show such
carset (don’t have appropriate font).

 

So, I’ve decided to provide transliteration
mechanism, i.e. convert chars from Cyrillic to Latin. The language that I’m
going to use is Java. 

Can you guys point me on some useful resource to do so
or give me some recommendation?

=

I’ve made some preliminary prototyping, and results
appear to be weird. 

1 I provide a mapping from a char (lets say Cyrillic)
to its Latin equivalent in sense of transliteration 

2 Take the flat file and process it (convert from
Cyrillic to Latin)

 

Sometimes its working, sometimes its not… 

Apparently when I run simple things from my IDE it
works fine, but when I’m trying to do the same in standalone mode –
it skips processing.

I was hunting down the problem and this is the difference
I see:

 

When I do call like this Character.UnicodeBlock.of(toProcess)
for next char to transliterate, it shows 

From IDE - CYRILLIC

Standalone - LATIN_1_SUPPLEMENT

 

So, I guess the way flat file is read makes big
difference… I’m willing to blame some difference in system
properties settings for to such calls…

Can you help me with pointers to make it the way it
should be?

 

Thanks, Dennis








Re: New contribution N2676

2003-10-25 Thread Michael Everson
At 02:29 +0200 2003-10-25, Philippe Verdy wrote:

0659 ARABIC ZWARAKAY . Pashto
Why not ARABIC MACRON ? Well, Zwarakay may be appropriate if this is the
transliterated Arabic name.
It isn't a macron. It's a zwarakay, and that's a Pashto name.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Traditional dollar sign

2003-10-25 Thread Peter Kirk
On 25/10/2003 10:16, Asmus Freytag wrote:

At 03:36 AM 10/26/03 +1100, Simon Butcher wrote:

Just a quick question.. The description for U+0024 (DOLLAR SIGN) 
states that the glyph may contain one or two vertical bars. Is there 
a codepoint specifically for the traditional double-bar form, or any 
plan to include one in the future?


No.

I was taught at school that the double-bar form was used when 
Australia switched to decimal currency in 1966, and that it was 
incorrect to write the single-bar form when referring to Australian 
dollars.


It would be interesting if you could document that.

I guess the single-bar form had taken over due to the lack of support 
from type-faces and computing devices, although it's still quite 
common to see it in Australian publications, especially in large 
fonts (headlines, advertising, etc).


It looks like actual practice is what you describe: the free 
alternation between the form without change in meaning.

If we were to add a code point we would get into the situation that 
the free alternation would suddenly become a matter of content 
difference (not just a choice in presentation). In other cases where 
the majority of users freely alternate, but there is indication that 
some subset of users need to maintain a form distinction we have used 
standardized variants. This has been done mostly for mathematical 
symbols.

In theory, this could be done here as well, but any thoughts in that 
direction would need to be preceded by clear and compelling evidence 
of an actual requirement. The case of an official preference that has 
never been widely adhered to -- which is what you have described -- 
would probably not qualify as grounds for taking any action.

A./




The situation seems very similar to that for U+20A4 vs. U+00A3. I was 
taught at school in the UK, and I guess Australians were taught before 
1966, to write the pound sign with two bars like U+20A4, and in fact I 
still usually do so in handwriting. But today the single-barred version 
is much more common in print in the UK. And the notes for U+20A4 suggest 
that this became true also in Italy, before the Euro was introduced.

I wonder how long before the Euro will also de facto have a single bar?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Merging combining classes, was: New contribution N2676

2003-10-25 Thread Stefan Persson
Philippe Verdy wrote:

The problem with this solution is that stability is not guaranteed across
backward versions of Unicode: if a tool A implements the new version of
combining classes and normalizes its input, it will keep the relative
ordering of characters. If its output is injected into a tool B that still
uses
the legacy classes, the tool B may either reject the input (not normalized)
or force the normalization. Then is the text comes back to tool A, it will
see a modified text.
Wouldn't it be possible to, if this is of any importance in a specific 
situation, specify a Unicode version, and not utilise additional 
normalisation data that is only specified in later versions than the 
specified version?  For example,

  x = normalise("some text", 4.0);

normalises the text according to the rules specified in Unicode 4.0, or, 
if the software has not yet been updated with this information, 
according to the rules in an earlier version of Unicode, while

  x = normalise("some text");

would normalise the text according to the most recent version of Unicode 
for which the "normalise" program has any data.

Stefan




Re: New contribution N2676

2003-10-25 Thread Raymond Mercier
 Among the new characters in N2676 there is

 10186 G GREEK ARTABE SIGN

 This is one of the many signs found in papyri, such as those edited by
Kenyon. This symbol represents apparently a measure of volume used for
grain. It appears as a small circle, smaller than omicron, with a long
overline, much longer than a macron.

 While I have been looking for the various forms of the symbol for zero I
find in other papyri quite exacty the same character used for 'zero'. I make
this comparison after studying many photographs of papyri, those provided
with Kenyon's editions on the one hand, and on the other, Alexander Jones'
recent volume of horoscopes, Astronomical Papyri from Oxyrhynchus.
 The attached image is take from Jones, part of a column of zeroes written
this way.

 Raymond Mercier

> - Original Message - 
> From: "Michael Everson" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Friday, October 24, 2003 7:36 PM
> Subject: New contribution N2676
>
>
> >
> > A new contribution:
> > N2676
> > Repertoire additions from meeting 44
> > Asmus Freytag
> > 2003-10-23
> > http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2676.pdf
> >
> > -- 
> > Michael Everson * * Everson Typography *  * http://www.evertype.com
>
<>

Re: Traditional dollar sign

2003-10-25 Thread Asmus Freytag
At 03:36 AM 10/26/03 +1100, Simon Butcher wrote:
Just a quick question.. The description for U+0024 (DOLLAR SIGN) states 
that the glyph may contain one or two vertical bars. Is there a codepoint 
specifically for the traditional double-bar form, or any plan to include 
one in the future?
No.

I was taught at school that the double-bar form was used when Australia 
switched to decimal currency in 1966, and that it was incorrect to write 
the single-bar form when referring to Australian dollars.
It would be interesting if you could document that.

I guess the single-bar form had taken over due to the lack of support from 
type-faces and computing devices, although it's still quite common to see 
it in Australian publications, especially in large fonts (headlines, 
advertising, etc).
It looks like actual practice is what you describe: the free alternation 
between the form without change in meaning.

If we were to add a code point we would get into the situation that the 
free alternation would suddenly become a matter of content difference (not 
just a choice in presentation). In other cases where the majority of users 
freely alternate, but there is indication that some subset of users need to 
maintain a form distinction we have used standardized variants. This has 
been done mostly for mathematical symbols.

In theory, this could be done here as well, but any thoughts in that 
direction would need to be preceded by clear and compelling evidence of an 
actual requirement. The case of an official preference that has never been 
widely adhered to -- which is what you have described -- would probably not 
qualify as grounds for taking any action.

A./



Re: Merging combining classes, was: New contribution N2676

2003-10-25 Thread Peter Kirk
On 25/10/2003 09:11, Philippe Verdy wrote:

From: "Peter Kirk" <[EMAIL PROTECTED]>
 

...

The problem would then be the interoperability of Unicode-compliant
systems using distinct versions of Unicode (for example between
XML processors, text editors, input methods, renderers, text
converters, full text search engines. This may even be critical in
tools like sorting, in applications that require and expect that their
input is sorted according to its locale in a predictable way (for
example in applications using binary searches in sorted lists of
text items, such as authentication in a list of user names, or
a filenames index).
 

I can see that there might be some problems in the changeover phase. But 
these are basically the same problems as are present anyway, and at 
least putting them into a changeover phase means that they go away 
gradually instead of being standardised for ever, or however long 
Unicode is planned to survive for.

It isn't a problem for XML etc as in such cases normalisation is 
recommended but not required, thankfully. As for requirements that lists 
are normalised and sorted, I would consider that a process that makes 
assumptions, without checking, about data received from another process 
under separate control is a process badly implemented and asking for 
trouble.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Traditional dollar sign

2003-10-25 Thread Simon Butcher

Hi!

Just a quick question.. The description for U+0024 (DOLLAR SIGN) states that the glyph 
may contain one or two vertical bars. Is there a codepoint specifically for the 
traditional double-bar form, or any plan to include one in the future?

I was taught at school that the double-bar form was used when Australia switched to 
decimal currency in 1966, and that it was incorrect to write the single-bar form when 
referring to Australian dollars. I guess the single-bar form had taken over due to the 
lack of support from type-faces and computing devices, although it's still quite 
common to see it in Australian publications, especially in large fonts (headlines, 
advertising, etc).

Cheers!

 - Simon




Re: Merging combining classes, was: New contribution N2676

2003-10-25 Thread Philippe Verdy
From: "Peter Kirk" <[EMAIL PROTECTED]>
> I wonder if it would in fact be possible to merge certain adjacent
> combining classes, as from a future numbered version N of the standard.
> That would not affect the normalisation of existing text; text
> normalised before version N would remain normalised in version N and
> later, although not vice versa. I know that this would break the letter
> of the current stability policy, but is this kind of backward
> compatibility actually necessary? The change could be sold to others as
> required for the internal consistency of Unicode.

The problem with this solution is that stability is not guaranteed across
backward versions of Unicode: if a tool A implements the new version of
combining classes and normalizes its input, it will keep the relative
ordering of characters. If its output is injected into a tool B that still
uses
the legacy classes, the tool B may either reject the input (not normalized)
or force the normalization. Then is the text comes back to tool A, it will
see a modified text.

One could argue that a CCO control may be generated when converting
for backwards versions of Unicode. But will tool A know the version of
Unicode used by legacy tool B, if B is a remote service that does not
provide this version information to A?

The problem would then be the interoperability of Unicode-compliant
systems using distinct versions of Unicode (for example between
XML processors, text editors, input methods, renderers, text
converters, full text search engines. This may even be critical in
tools like sorting, in applications that require and expect that their
input is sorted according to its locale in a predictable way (for
example in applications using binary searches in sorted lists of
text items, such as authentication in a list of user names, or
a filenames index).




Re: CGJ - Combining Class Override

2003-10-25 Thread Philippe Verdy
From: "Jony Rosenne" <[EMAIL PROTECTED]>

> For the record, I repeat that I am not convinced that the CGJ is an
> appropriate solution for the problems associated with the right Meteg. I
> tend to think we need a separate character.

Yes, it's possible to devize another character explicitly to override
very precisely the ordering of combining classes. But this still
does not change the problem, as all the existing NF* forms in
existing documents using any past or present version of Unicode
MUST remain in NF* form with further additions.

If one votes for a separate control character, it should come with
precise rules describing how such override can/must be used, so
that we won't break existing implementations. This character will
necessary have a combining class 0, but will still have a preceding
context. Strict conformance for the new NF* forms must still obey
to the precise ordering rules, and this character, whatever its form,
shall not be used everytime it is not needed, i.e. when the existing
NF* forms still produce the correct logical order (that's why its
use should then be restricted to a list of known combining
characters that may need this override).

Call it  "Combining Class Override" ? This does not change
the problem: this character should be used only between pairs
of combining characters, such as the encoded sequence:
{c1, CCO, c2}
shall conform to the rules:
(1) CC(c1) > CC(c2) > 0,
(2) c1 is known (listed by Unicode?) to require this override
to keep the logical ordering needed for correct text semantics.

The second requirement should be made to avoid abuses of this
character. But it is not enforceable if CGJ is kept for this function.

The CCO character should then be made "ignorable" for
collation or text breaks, so that collation keys will become:
[ CK(c1), CK(c2) ]  for {c1, CCO, c2}
[ CK(c2), CK(c1) ]  for {c2, c1} and {c1, c2} if normalized

Legacy applications will detect a separate combining sequence
starting at CCO, but newer applications will still know that both
sequences are describing a single grapheme cluster.

This knowledge should not be necessary except in grapheme
renderers, or in some input methods that will allow users to
enter:
(1) keys  producing the normalized text {c2, c1}
 as before;
(2) keys  producing the normalized text {c1, CCO, c2}
 instead of {c2, c1} as before;
(3) optionally support a keystroke or selection system to swap
 combining characters.

If this is too complex, the only way to manage the situation is
to duplicate existing combining characters that cause this problem,
and I think this may go even worse as this duplication may need
to be combinatorial and require a lot of new codepoint assignments.




CGJ

2003-10-25 Thread Jony Rosenne
For the record, I repeat that I am not convinced that the CGJ is an
appropriate solution for the problems associated with the right Meteg. I
tend to think we need a separate character.

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy
> Sent: Saturday, October 25, 2003 1:12 PM
> To: Peter Kirk
> Cc: [EMAIL PROTECTED]
> Subject: Re: New contribution N2676
> 
> 
> From: "Peter Kirk" <[EMAIL PROTECTED]>
> > Have combining classes actually been defined for these characters?
> >
> > This is of course exactly the same problem as with Hebrew 
> vowel points 
> > and accents, except that this time it applies to real living 
> > languages. Perhaps it is time to do something about these combining 
> > classes which conflict with the standard.
> 
> Do you mean officially documenting the correct (and strict) 
> use of CGJ as the only way to bypass the default order 
> required by the combining classes in normalized forms? It 
> would be a good idea to document officially which use of CGJ 
> is superfluous and should be avoided in NF forms, and which 
> use is required.
> 
> 1) This will affect only the input methods for those 
> languages that need to "swap" the standard order of combining 
> characters to keep their logical order (all this will require 
> is a additional input control that will try swapping 
> ambiguous orders).
> 
> 2) A complete documentation may need to specify which pairs 
> of combining characters are affected (this should list the 
> pairs of combining characters  where CC(c1) > CC(c2) 
> and that require to be encoded  to be kept in 
> logical order, as the sequence  will be reordered 
> into  in normalized forms.
> 
> 3) The other issue would be that there may exist other 
> combining characters than those in this pair. Suppose I want 
> to represent , where CC(c1) > CC(c2), but 
> c3 does not have a conflicting pair in the previous list. 
> Should it be encoded as  or as  c1, c3, CGJ, c2>? As the standard normalization algorithm 
> cannot be changed, both sequences will be possible with the 
> NF forms, even though they represent the same character.
> 
> One could design an extra normalization step to force one 
> interpretation (so that only combining characters with 
> conflicting combining classes that have been forced "swapped" 
> will appear after CGJ, all other diacritics being encoded 
> preferably in the first sequence before the CGJ).
> 
> This extra step should not be part of the NF forms (because 
> Unicode states that normailzed forms will be kept normalized 
> in all further versions of Unicode), but this could be named 
> differently, by describing a system in which extra 
> normalization steps may be applied that may change NF forms 
> into other "equivalent" sequences also in normalized form.
> 
> 
> 
> 





Re: unicode on Linux

2003-10-25 Thread Stefan Persson
Jungshik Shin wrote:

> the applications do not expect UTF-8, for instance
> ls sorts alphabetically but dot not know Unicode sorting).
  Does 'ls'  sort  filenames when they're in ISO-8859-1?
My "ls", using the sv_SE.ISO-8859-1 locale, properly sorts file names 
alphabetically.

Stefan




Re: unicode on Linux

2003-10-25 Thread Jungshik Shin
Stephane Bortzmeyer wrote:

> Kernel
> 1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
> so you can always encode in UTF-8, but the kernel does not do any
> normalization

  As other have written, I don't think kernel has any business with
normalization (although on Mac OS X, apparently the kernel does).

> the applications do not expect UTF-8, for instance
> ls sorts alphabetically but dot not know Unicode sorting).

  Does 'ls'  sort  filenames when they're in ISO-8859-1?

> 2) User names: worse since utilities to create an account refuses
> UTF-8.

  Yeah, this should be fixed.

> Applications
>
> 3) grep: no Unicode regexp

  I agree that grep and many other text utilities need to be updated to
honor the locale (LC_COLLATE, LC_CTYPE and others). With glibc 2.2.x or
later and gnulib, it shouldn't be as hard as before. In addition, you
always have perl and python to turn to (both support Unicode very well).
 Also note that I wrote about 'honoring the locale' instead of
supporting UTF-8, by which I want to emphasize that it's not just UTF-8
but also legacy character encodings that are not supported by grep and
other GNU textutils used on Linux.

> 4) xterm (or similar virtual terminals): No BiDi support at all

   mlterm does. It even supports Indic scripts. (xterm supports Thai
script and Korean script, though). Do you have any terminal emulator
running on other platforms that do BIDI well?

> 5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
> that have Unicode character semantics (back-character should move one
> character, not one byte)

  A recent version of bash (to be precise, GNU libreadline it uses) has
no problem with UTF-8 handling (although it does not do well with
combining character sequences. that is, it doesn't have a notion of
grapheme clusters)

> 6) databases: I'm not aware of a free DBMS which has support for
> Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).

   Why is the OS to blame that there's no FREE DBMS that supports
Unicode collation and regular expression?  Needless to say, there
are commerical DBMS' that do both  and run on Linux.


> 7) Serious word processing: LaTeX has only very minimum Unicode

  Well, Linux distributions come  not only with LaTeX/TeX but also with
Lambda/Omega, their Unicode cousins. Opentype font support in
Omega/Lambda is not there, yet, but Indic scripts and other complex
scripts (e.g. Korean script) can be typeset with Omega/Lambda. Anyway,
LaTeX/Lambda are not for word processing. If you want a word processor,
you have to try openoffice/staroffice, Abiword, kwrite, and so forth
that support Unicode well.

> Also, many applications (exmh, emacs) are ten times slower when
> running in UTF-8 mode.

  Emacs' adoption of Unicode has been moving frustratingly slow and the
performance  may be  slower in UTF-8 mode than otherwise(actually,
there are a couple of diffent UTF-8 implementations for Emacs and I don't
know which one you tried), but Vim is not. The reasom Emacs is that much
slower is likely to do with the fact UTF-8 support is retrofitted to the
ISO-2022-based infrastructure of MULE. Other applications on Linux do NOT
have to carry that baggage so that they are not any slower in UTF-8 mode
than in legacy encoding. Actually, they should be faster in UTF-8 because
most modern toolkits/applications for Linux are based on Unicode and in
UTF-8, there's no (if UTF-8 is the internal representation as in gtk) or
little (if UTF-16 is used internally as in Qt) overhead for the codeset
conversion.  Pls, don't extrapolate from just a couple of bad examples.

> At the present time, using Unicode on Unix is an act of faith.

  Well, I thought this is 2003. You wrote as if it's 2000. You sound
like a one-time 'convert' who lost one's faith a long time ago and has
never come back to see how much has changed since.

Moreover, in the above sentence, that you used 'Unix' instead of Linux,
Sun and IBM engineers who worked on UTF-8 locale support on Solaris and
AIX may take an offense at your remark. I can't say much about AIX
except that it has supported UTF-8 locales as long as Solaris has. As
for Solaris, Solaris 7 (released in mid-1990's) and onward don't even
have some remaining problems Linux still have (i.e. grep/sed/ls/sort and
other textutils not honoring the locale in their handling of text
streams).


>> Default charset for recent versions of some popular distributions.
>
>
> Yes, RedHat changed the default charset to Unicode without thinking
> that text files were no longer readable.

  Unreadable? What is iconv(1) for? Perhaps, RH should have included a
nice GUI migration tool (as a part of the RH 8/9 installation disk)to
let clueless end users(Mom and Pop) convert all their text files in
legacy encodings to UTF-8 along with a similar tool for the filename
conversion.

   I'm not saying that using Unicode (mostly in the form of UTF-8) on
Linux is as seamless as I wish it to be (there are a number of issues
I wan

Merging combining classes, was: New contribution N2676

2003-10-25 Thread Peter Kirk
On 25/10/2003 04:11, Philippe Verdy wrote:

From: "Peter Kirk" <[EMAIL PROTECTED]>
 

Have combining classes actually been defined for these characters?

This is of course exactly the same problem as with Hebrew vowel points
and accents, except that this time it applies to real living languages.
Perhaps it is time to do something about these combining classes which
conflict with the standard.
   

Do you mean officially documenting the correct (and strict) use of CGJ as
the only way to bypass the default order required by the combining classes
in normalized forms? It would be a good idea to document officially which
use of CGJ is superfluous and should be avoided in NF forms, and which use
is required.
 

This isn't what I meant, but I agree that some such definition would be 
a good idea.

What I had in mind was a probably hopeless plea for the wrongly assigned 
combining classes to be corrected. After all, the current assignments 
manifestly breach the standard, because marks with different classes 
interact typographically.

I wonder if it would in fact be possible to merge certain adjacent 
combining classes, as from a future numbered version N of the standard. 
That would not affect the normalisation of existing text; text 
normalised before version N would remain normalised in version N and 
later, although not vice versa. I know that this would break the letter 
of the current stability policy, but is this kind of backward 
compatibility actually necessary? The change could be sold to others as 
required for the internal consistency of Unicode.

If this were possible, the Hebrew and Arabic problem could be partly 
solved, in a non-optimal way but one which is less messy than the 
current situation. The idea would be for all Hebrew marks (i.e. all 
combining marks in 05B0-05C2) to be merged into one combining class, and 
similarly all Arabic harakat etc. including the new Arabic tone signs. 
This would make significant the relative orderings of multiple vowels 
(and meteg), and avoid the need for CGJ hacks. It would also allow the 
logical order of shadda, dagesh and sin and shin dots to be the 
canonical one, with significant advantages for collation etc as well as 
for rendering.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: New contribution N2676

2003-10-25 Thread Philippe Verdy
From: "Peter Kirk" <[EMAIL PROTECTED]>
> Have combining classes actually been defined for these characters?
>
> This is of course exactly the same problem as with Hebrew vowel points
> and accents, except that this time it applies to real living languages.
> Perhaps it is time to do something about these combining classes which
> conflict with the standard.

Do you mean officially documenting the correct (and strict) use of CGJ as
the only way to bypass the default order required by the combining classes
in normalized forms? It would be a good idea to document officially which
use of CGJ is superfluous and should be avoided in NF forms, and which use
is required.

1) This will affect only the input methods for those languages that need to
"swap" the standard order of combining characters to keep their logical
order (all this will require is a additional input control that will try
swapping ambiguous orders).

2) A complete documentation may need to specify which pairs of combining
characters are affected (this should list the pairs of combining characters
 where CC(c1) > CC(c2) and that require to be encoded 
to be kept in logical order, as the sequence  will be reordered into
 in normalized forms.

3) The other issue would be that there may exist other combining characters
than those in this pair.
Suppose I want to represent , where CC(c1) > CC(c2), but
c3 does not have a conflicting pair in the previous list. Should it be
encoded as  or as ? As the
standard normalization algorithm cannot be changed, both sequences will be
possible with the NF forms, even though they represent the same character.

One could design an extra normalization step to force one interpretation (so
that only combining characters with conflicting combining classes that have
been forced "swapped" will appear after CGJ, all other diacritics being
encoded preferably in the first sequence before the CGJ).

This extra step should not be part of the NF forms (because Unicode states
that normailzed forms will be kept normalized in all further versions of
Unicode), but this could be named differently, by describing a system in
which extra normalization steps may be applied that may change NF forms into
other "equivalent" sequences also in normalized form.




Re: U+0BA3, U+0BA9

2003-10-25 Thread Peter Jacobi
Hi Kenneth, All,

Thank you for the quick clarification of matters.

Kenneth Whistler <[EMAIL PROTECTED]> wrote:
> U+0BA3 TAMIL LETTER NNA is the retroflex n, usually transliterated
> as n-underdot .

which is N UofKöln transliteration, I assume.
 
> U+0BA9 TAMIL LETTER NNNA is the distinct alveolar n, usually
> transliterated as n-macronbelow .

which is n2 UofKöln transliteration, I assume.

> The 10646 naming conventions, which are stuck with A-Z for
> transliteration, generally use doubled letters to indicate
> retroflex consonants, particular for Indic languages. When
> a third distinction needs to be made, as for Tamil, the
> third name occasionally just gets a tripled letter, as is
> the case for U+0BA9.

So, in effect the UNICODE character names attempt to be
a unified transliteration scheme for all languages? Are these
principles laid down somewhere or is this more informal?

> TSCII naming conventions may differ.

I assume the TSCII authors got the UNICODE names mixed up, as
Tamil is not short of differing transliteration scheme already before
seeing the UNICODE one.

Regards,
Peter Jacobi

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++




Re: New contribution N2676

2003-10-25 Thread Peter Kirk
On 24/10/2003 18:09, Kenneth Whistler wrote:

...

Incidentally, the characters U+065A..U+065C are all tonal
diacritics for African languages written in the Arabic script.
They should not be confused with the similar shaped diacritics
which are part of the extended letters of Arabic. The tones can be
stacked on Arabic letters which already have letter diacritics
as part of their shapes.
 

Are they also potentially stacked with Arabic vowel signs (harakat)? If 
so, they interact with them typographically. And the standard specifies 
that they should therefore have the same combining classes as the 
harakat. The problem is, the harakat which appear in the same position 
have different combining classes. And if x<>y, there is no z such that 
z=x and z=y. So it is impossible to define these new characters in a way 
which does not conflict with the standard.

Have combining classes actually been defined for these characters?

This is of course exactly the same problem as with Hebrew vowel points 
and accents, except that this time it applies to real living languages. 
Perhaps it is time to do something about these combining classes which 
conflict with the standard.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/