Arabic Presentation Forms

2003-01-30 Thread Mete Kural
Hello Unicode-List,

I need to figure out a method to convert Arabic
Unicode text encoded in its normal form to Arabic
Unicode text encoded in Arabic presentation forms. So
the Unicode of each character having an Arabic
encoding (ex: 0622) would be converted to the
equivalent Arabic presentation form encoding
(ex:Fxxx). Thus if the character was a letter "Lam" in
the middle of a word, it would be encoded with the
corresponding presentation form encoding.

Do you any suggestions on how I could convert a piece
of Unicode text in this manner? Are there any programs
that could do this?

Thank you very much for the information.

Mete Kural





Re: vietnamese font

2003-01-30 Thread Paul Hastings
> 

correct.

> Plain old Arial actually isn't your best choice, because it displays the

actually i meant ms arial unicode. its handy but huge.

> Fonts on my Windows 2000 system (at work) that support the Vietnamese
> precomposed characters *and* display these combinations correctly
> include:

good to know. thanks.

> settle for a font that only supports the "Windows Vietnamese" (CP1258)
> set.  And do *not* let yourself get involved with VNI fonts.

ah more good advice, thanks again.





Re: vietnamese font

2003-01-30 Thread Doug Ewell
Paul Hastings  wrote:

> besides ms arial would anybody like to recommend a font suitable for
> vietnamese?



Plain old Arial actually isn't your best choice, because it displays the
circumflex-plus-grave and circumflex-plus-acute combinations
incorrectly.  For Vietnamese they're supposed to be side-by-side, not
one stacked on top of the other.  Courier New and Times New Roman have
the same problem.

Fonts on my Windows 2000 system (at work) that support the Vietnamese
precomposed characters *and* display these combinations correctly
include:

- Arial Unicode MS
- Code2000
- Gentium and GentiumAlt
- Microsoft Sans Serif (not "MS Sans Serif")
- Palatino Linotype (not "Palatino")
- Tahoma
- TITUS Cyberbit Basic
- Verdana
- Verdana Ref

Palatino Linotype is interesting: it displays the grave tone mark to the
right of the circumflex (^`), unlike the others (`^), but according to
some Vietnamese the first method is preferable.

On Verdana Ref, the vertical spacing is a bit too tight for comfort for
stacked combinations like circumflex-plus-tilde, but you may be able to
adjust this.

For best typography, you want to make sure your font supports the
Vietnamese precomposed characters in the U+1Exx range.  The decomposed
combinations are canonically equivalent, of course, but they may not
display as nicely on currently available rendering engines.  Don't
settle for a font that only supports the "Windows Vietnamese" (CP1258)
set.  And do *not* let yourself get involved with VNI fonts.



-Doug Ewell
 Fullerton, California





Re: urban legends just won't go away!

2003-01-30 Thread Kenneth Whistler

> This is a simple example demonstrating my own personal method.
> 

//to upper case
  public char upper(int c)
  {
return (char)((c >= 97 && c <=122) ?  VisitSewers(c) : c);
  }

  static int VisitSewers(int c)
  {
return AlligatorByte(c);
  }
  
  static int AlligatorByte(int c)
  {
// Remove SPACE from character and return mangled remnant.
return (c - 0x20); 
  }
  
  





Fwd: News: W3C home page genuinely served as UTF-8

2003-01-30 Thread Asmus Freytag
This came in recently:


From: Martin Duerst <[EMAIL PROTECTED]>
Subject: News: W3C home page genuinely served as UTF-8

This is just a very small news item that I wanted to share:
(probably too little too late, but a step in the right
direction anyway)

Since a few minutes, the W3C home page at http://www.w3.org
is finally served genuinely as UTF-8. It was already served
with UTF-8 as the charset/encoding for a while, but numeric
character references were used for the few non-US-ASCII
characters (copyright, registered) in the page, so that
the charset/encoding was not particularly relevant.

Regards,Martin.






Re: Indic Devanagari Query

2003-01-30 Thread Anto'nio Martins-Tuva'lkin
On 2003.01.29, 05:52, Aditya Gokhale <[EMAIL PROTECTED]> wrote:

> 1. In Marathi and Sanskrit language two characters glyphs of 'la' and
> 'sha' are represented differently as shown in the image below -
>
> (First glyph is 'la' and second one is 'sha')
>
> as compared to Hindi where these character glyphs are represented as
> shown in the image below -
>
> (First glyph is 'la' and second one is 'sha')

Not very different from the serbian vs. russian rendition of cyrillic
lower case "i" in italics. There are more examples, though (almost?)
none in the latin script.

--   .
António MARTINS-Tuválkin|  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 459 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |





Re[2]: ISO 639 arg - Esperanto

2003-01-30 Thread Anto'nio Martins-Tuva'lkin
On 2003.01.22, 17:04, Markus Scherer <[EMAIL PROTECTED]> wrote:

> If only Ferran used "real" characters (car) instead of
> transliterations (cxar) - oddly intermixed with using
> "real" é in "aragonés" :-}
>
> Sorry for being off-topic.

Not too off-topic, IMHO, as it concerns the still uncomplete penetration
of UTF encoding in e-mail practices. The message quoted used, as usual
in Esperanto, a 7-bit surogate along with cp1252. It is still very rare
to properly presented esperanto in e-mails (unlike, f.i., in webpages).

Replacing U+0302 with U+0079 (seldom others) is, BTW, not a
transliteration, but a surogate. AFAIK first used in the 1920ies for
Esperanto telegraphy.

--   .
António MARTINS-Tuválkin|  ()|
<[EMAIL PROTECTED]>   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 459 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |





vietnamese font

2003-01-30 Thread Paul Hastings
besides ms arial would anybody like to recommend a font suitable for
vietnamese?
thanks.
--
Paul Hastings [EMAIL PROTECTED]
Member Team Macromedia (ColdFusion)






RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread jameskass
.
Kent Karlsson wrote,

> > I add that this is a good way of displaying a combining mark that has no
> > base character, i.e. one occurring at the begin of a line or paragraph.
> 
> No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121).

So it says.  But, the 'space method' could be interpreted as being under
the "Simple Overlap" fallback rendering method, since the paragraph from
which you are quoting appears immediately after the paragraph treating 
with "Simple Overlap" method.

In the first paragraph under "Fallback Rendering" (bottom of page 120), 
the "Show Hidden" method is described.  It would have been redundant to 
say that degenerate cases **under the "Show Hidden" method** need to
be displayed as if on a dotted circle, since the "Show Hidden" method
is **all about displaying on dotted circles** whenever there is an
inability to draw the sequence.  Like, in the event of an invalid
sequence.

Whether to use "Show Hidden" or "Simple Overlap" method to display 
invalid or degenerate sequences should be left up to the various
software developers.

It does seem to be very easy to spot bad input when the dotted circle
appears in the display.  Stands out like a sore thumb, which was
probably the intention.

Perhaps this section of T.U.S. could stand some clarification.

Best regards,

James Kass
.




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread John Hudson
At 01:20 AM 1/30/2003, Marco Cimarosti wrote:


However, I totally agree with Kent that this funny rendering is *not* a
requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is
just an example of many "several methods [that] are available to deal with"
strange sequences.


Perhaps there is some confusion here because the use of the dotted circle 
is an explicit recommendation of the MS Indic OpenType spec, which is what 
the majority of Indian font developers are now working with. So there may 
be some confusion between what is expected in the MS spec and what is 
required by Unicode.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Marco Cimarosti
Keyur Shroff wrote:
> > However, I totally agree with Kent that this funny 
> rendering is *not* a
> > requirement of the Unicode standard, as Keyur Shroff seems 
> to suggest. It
> > is just an example of many "several methods [that] are 
> available to deal
> > with" strange sequences.
> 
> A sequence should not be treated as "strange" sequence if it has been
> written intentionally. It may have some contextual meaning.

I said "strange" in the sense of character sequences that are not part of
the ordinary spelling of any language. In fact, a thing like a matra
floating in the air or on a dotted circle is something that you'd only see
in a text (not necessarily *in* an Indian language) which talks about
spelling, character sets, and the like.

> Also, what is good or bad is also subjective. It may also 
> vary from one script to another.

Yes, but what is mandatory and what is not in Unicode sciould not be too
much subjective, else we could not call it a "standard".

_ Marco




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson

> Let me give a proper example this time. Consider a "Vowel Sign E" [U+0947]
> appearing after any non-consonant character. This sign is generally
> attached to the consonants. It has zero advance width with negative left
> side bearing in the font. 

Ok.

> Clearly, since in this case the sign is not
> preceded by any consonant base, it has to be rendered using one of the
> mechanisms specified in fallback rendering of non-spacing marks.

If it is preceded by a SPACE (or is first in a string/paragraph/similar)
it should be rendered as a "freestanding" glyph (no dotted circle).  If it
is preceded, in the source string, by, say, FULL STOP, a typographically
acceptable rendering would be to have the vowel sign E glyph float on
top of the glyph for the FULL STOP (no dotted circle).  Similarly for a
vowel sign E that follows a LATIN CAPITAL LETTER A. (But I don't expect
good positioning, just readable.) Again similarly, a vowel sign I that
follows an EQUAL SIGN should be rendered as a vowel sign I glyph to the
left of an EQUAL SIGN glyph.  No dotted circle. (I know that the reordrant
vowel signs may reorder over more than the preceding base character IF it
is a (sub)string in an Indic script.) Again similarly, a 
string should be rendered as a KA + II + II glyph sequence (invoking
any ligature for KA + II if there is one in the font; II + II is
unlikely to have any ligature, since it is not used by any orthography). 
No dotted circle(s). The fallback hinted at in TUS 3.0 that uses dotted
circles is 1) typographically horrible, and 2) cannot indicate that
there is any error in the given character sequence.

...
> the application. Now in order to render it with dotted circle if we
> introduce the circle in the text before this sign then also 
> the circle is invalid base for this "Vowel Sign E".

No base character is invalid for any combining character.

...
> > Languages or syllable boundaries have nothing to do with this. These
> > special
> > sequences should *never* be part of any syllable or word in any language:
> > they are just a way of showing the shape of a glyph, to be used when,
> > e.g., talking about typography or spelling.
> 
> Then how can we rake care of fallback mechanism?

By removing that particular fallback mechanism from implementations
as well as the TUS text!  (I'm serious!) This particular fallback
mechanism is NOT recommended as it stands.  But since its mention is
erroneously taken as a recommendation, I'd suggest removing also its
mention.  That mechanism is as bad as misplacing glyphs for combining
marks on the glyph(s) for the follow-on character, if not worse.
("Show invisibles" (for all of the text or a "user" selected run
of the text) is an entirely different story.)

/Kent K





RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Kent Karlsson

> > I don't know where you find support for that position in that text.
> > Can you please quote?  There are no "invalid base consonants" for
> > any dependent vowel (for Indic scripts; similarly for any 
> > other script).
> 
> Actually, there is a mention of displaying combining marks on dotted
> circles:

I know.  But there is no mention (that I have found) of "invalid base
characters" or any recommendation for using dotted circles especially
for Indic scripts.

> I add that this is a good way of displaying a combining mark that has no
> base character, i.e. one occurring at the begin of a line or paragraph.

No, those should be displayed *as if* preceded by a SPACE (TUS 3.0 page 121).

/Kent K





RE: urban legends just won't go away!

2003-01-30 Thread Carl W. Brown
Barry,

If you think that this is bad try 390 mainframe EBCDIC shift to upper case.
You can shift up to 256 characters at a time with a single machine language
instruction by ORing a line of spaces to your character field.  Now that is
bit flipping and is still heavily used.

Carl


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Barry Caplan
> Sent: Wednesday, January 29, 2003 10:01 PM
> To: [EMAIL PROTECTED]
> Subject: urban legends just won't go away!
>
>
> http://archive.devx.com/free/tips/tipview.asp?content_id=4151
>
> Who knew in this day and age flipping bits to change case is
> still publishable (this is from today!)
>
> Barry Caplan
> www.i18n.com
> Vendor Showcase: http://Showcase.i18n.com
>
>
> --
>
> Use Logical Bit Operations to Changing Character Case
>
>
> This is a simple example demonstrating my own personal method.
>
> // to lower case
>   public char lower(int c)
>   {
>return (char)((c >= 65 && c <= 90) ? c |= 0x20 : c);
>   }
>
> //to upper case
>   public char upper(int c)
>   {
> return (char)((c >= 97 && c <=122) ?  c ^= 0x20 : c);
>   }
> /*
>  If I would I could create a method for converting an entire
> string to lower, like this:
> */
>   public String getLowerString(String s)
>   {
>  char[] c = s.toCharArray();
>  char[] cres = new char[s.length()];
>  for(int i=0;i  cres[i] = lower(c[i]);
>  return String.valueOf(cres);
>   }
> /*
> even converting in capital:
> */
>   public String capital(String s)
>   {
>  return
> String.valueOf(upper(s.toCharArray()[0])).concat(s.substring(1));
>   }
> /* using it*/
> public static void main(String args[])
>   {
>  x xx = new x();
>  System.out.println(xx.getLowerString("LOWER: " + "FRAME"));
>  System.out.println(xx.upper('f'));
>  System.out.println(xx.capital("randomaccessfile"));
> }
>
>
>






RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Keyur Shroff

--- Marco Cimarosti <[EMAIL PROTECTED]> wrote:

> 
> I add that this is a good way of displaying a combining mark that has no
> base character, i.e. one occurring at the begin of a line or paragraph.
> 
> However, I totally agree with Kent that this funny rendering is *not* a
> requirement of the Unicode standard, as Keyur Shroff seems to suggest. It
> is just an example of many "several methods [that] are available to deal
> with" strange sequences.

A sequence should not be treated as "strange" sequence if it has been
written intentionally. It may have some contextual meaning.

> 
> > Any combining characters can be placed on any base characters without
> > there being any dotted circles displayed.

Not only that, but it is also desirable. How can one write a vowel matra
both with and without dotted circle in a single document if Unicode
recommends to place it only on top of space character? Matra with dotted
circle is sometimes useful as in the case of printing/explaining Unicode
standard. A user may want to hide dotted circle in the same document in
order to explain the actual shape of the matra character, i.e., without
dotted circle. Both kind of rendering behaviour is possible. There should
be some mechanism either to turn on or off dotted circle depending on the
default behaviour.

Also, what is good or bad is also subjective. It may also vary from one
script to another.

- Keyur



__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Marco Cimarosti
Kent Karlsson wrote:
> Keyur Shroff wrote
> [...]
> > In Indic scripts any sign that appear in text not in 
> > conjunction with a
> > valid consonant base may be rendered with dotted circle as fallback
> > mechanism (Section 5.14 "Rendering Nonspacing Marks"
> > http://www.unicode.org/uni2book/ch05.pdf).
> 
> I don't know where you find support for that position in that text.
> Can you please quote?  There are no "invalid base consonants" for
> any dependent vowel (for Indic scripts; similarly for any 
> other script).

Actually, there is a mention of displaying combining marks on dotted
circles:

"Several methods are available to deal with an unknown composed
character sequence that is outside of a fixed, renderable set [...]. One
method (Show Hidden) indicates the inability to draw the sequence by drawing
the base character first and then rendering the nonspacing mark as an
individual unit - with the nonspacing mark positioned on a dotted circle."
(The Unicode Standard 3.0, page 120 - 5.14 Rendering Nonspacing Marks -
Fallback Rendering)

I add that this is a good way of displaying a combining mark that has no
base character, i.e. one occurring at the begin of a line or paragraph.

However, I totally agree with Kent that this funny rendering is *not* a
requirement of the Unicode standard, as Keyur Shroff seems to suggest. It is
just an example of many "several methods [that] are available to deal with"
strange sequences.

> > Any system implementing this as
> > default behaviour should not be considered buggy.
> 
> Indeed they are.  And it should certainly not be default behaviour.

In this case, I disagree with Kent: displaying these dotted circles is not
mandatory, but certainly not a bug.

> Any combining characters can be placed on any base characters without
> there being any dotted circles displayed.

True. But notice that Kent (against his own opinion) correctly wrote "can",
not "must".

> [...]

_ Marco




Re: Suggestions in Unicode Indic FAQ

2003-01-30 Thread Aditya Gokhale



 
To support what Kayur has to say I will add few more 
things.
 
Take for instance a "vowel sigh" (matras as we call here in 
India) e.g. say is e (U093F), is combined with a consonant like ka (U0915) in 
the sequence it forms ke. (Please see the first image). The repositioning of the 
shape happens automatically. If the anyone puts the e matra (U093F) first and 
then the consonant ka (U0915) then they should be highlighted, putting a space 
can still make them look like ka. So to make this mistake very explicit, we have 
to put a dotted circle. If wrong combination is stored, it will create lot 
of problem in searching the data as well as in sorting. Please refer BIS (Bureau 
of Indian Standards) ISCII 91 / 88 documentation for this where in 
 

 
 
-Aditya
 
 
- Original Message - 
From: "Keyur Shroff" <[EMAIL PROTECTED]>
To: "'Unicode Mailing List'" <[EMAIL PROTECTED]>
 > --- Marco Cimarosti <[EMAIL PROTECTED]> wrote:> 
> Keyur Shroff wrote:> > > But sometimes a user may want visual 
representation of these > > > symbols in two different ways: with 
dotted circle and> > > without dotted circle.> > > 
> Why not using a dotted circle character explicity, when you want to 
see> > one?> > Note that whenever I mention the word 
"combining mark" I am really talking> about "vowel signs (matras)" and 
other modifiers in Indic scripts which is> script dependent. I am sorry 
if I have confused you with the combining> diacritical marks in the block 
[U+0300-U+036F] which I really didn't mean.> > Let me give a 
proper example this time. Consider a "Vowel Sign E" [U+0947]> appearing 
after any non-consonant character. This sign is generally> attached to 
the consonants. It has zero advance width with negative left> side 
bearing in the font. Clearly, since in this case the sign is not> 
preceded by any consonant base, it has to be rendered using one of the> 
mechanisms specified in fallback rendering of non-spacing marks. If we> 
render it with space, as you said, then we have to insert "space" 
character> at the time of fallback rendering (which can be taken care in 
rendering> pipeline) even though space character is not present in 
backing store of> the application. Now in order to render it with dotted 
circle if we> introduce the circle in the text before this sign then also 
the circle is> invalid base for this "Vowel Sign E". As a result, again 
fallback rendering> will take place with rendering circle and the vowel 
sign positionally> separate. In this case first dotted circle will apear 
which will be> followed by vowel sign (matra) on top of space 
character.> > If you know any other way to solve this problem then 
please explain. Also> let me know if I have misinterpreted the text 
written in Unicode standard.> > > > > > > 
Example of> > > this could be RAsup on top of dotted circle and 
RAsup on top of space> > > character. Current use of space 
character to eliminate dotted > > > circle is really painful and 
may create problems in determining > > > language and syllable 
boundaries.> > > > Languages or syllable boundaries have 
nothing to do with this. These> > special> > sequences 
should *never* be part of any syllabe or word in any language:> > they 
are just a way of showing the shape of a glyph, to be used when,> > 
e.g., talking about typography or spelling.> > Then how can we 
rake care of fallback mechanism?> > > Thanks for taking 
pain for answering my queries :-)> > - Keyur> > 
> > __> Do 
you Yahoo!?> Yahoo! Mail Plus - Powerful. Affordable. Sign up 
now.> http://mailplus.yahoo.com