Re: Suggestions in Unicode Indic FAQ

2003-02-05 Thread Doug Ewell
Kent Karlsson kentk at md dot chalmers dot se wrote:

 Consider English.  If I write , that may well be a spell error.

Or even Ŋŋŋŋ!, as Michael Everson wrote in WG2 N2306.

-Doug Ewell
 Fullerton, California





Re: Indic Devanagari Query

2003-02-05 Thread Andrew C. West
On Wed, 05 Feb 2003 02:00:30 -0800 (PST), [EMAIL PROTECTED] wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of 
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
about using them for characters other than mathematical symbols.

Andrew




Re: How to convert special characters into unicode?

2003-02-05 Thread Chris Jacobs

- Original Message - 
From: SRIDHARAN Aravind [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, February 05, 2003 8:27 AM
Subject: How to convert special characters into unicode?


 How to get unicode values for special characters in Java?
 I have a set of Czech special characters?
 
 For LATIN CAPITAL LETTER C WITH CARON, the unicode is 010c and 010d ( for both upper 
and lower cases).
 And I got this value from a PDF chart(u0100.odf) in www.unicode.org
 
 I have Czech special characters in an excel file.
 I copy them into Notepad.

Try copying to WordPad instead.

 I save them.

In WordPad you can save as unicode.

 Now I use native2ascii convertor that is available with JDK.

Why? You don't want to convert to ascii.

 After I run this utility, I am getting some other unicode values or sometimes only 
whitespaces come out.
 I don't know why?
 
 Please let me know how I convert special characters into unicode.
 Thank you.
 
 Aravind.
 





Re: list etiquette (was Re: Tailoring of normalization

2003-02-05 Thread Lars Marius Garshol

* [EMAIL PROTECTED]
| 
| Please forgive me and others who are on similar set-ups if this is
| all just too much of a pain!

It is hard for people to avoid giving others two copies of replies to
on-list messages. In my case I've solved this, since my email client
(Gnus) detects duplicate messages (based on message ID) and inserts a
warning field. I filter messages which have this into my Spam box.
Problem solved.

-- 
Lars Marius Garshol, Ontopian URL: http://www.ontopia.net 
GSM: +47 98 21 55 50  URL: http://www.garshol.priv.no 





Re: How to convert special characters into unicode?

2003-02-05 Thread Doug Ewell
SRIDHARAN Aravind ASridharan at covansys dot com wrote:

 I have Czech special characters in an excel file.
 I copy them into Notepad.
 I save them.

 Now I use native2ascii convertor that is available with JDK.
 After I run this utility, I am getting some other unicode values or
 sometimes only whitespaces come out.
 I don't know why?

As Chris said, pasting them into Notepad is probably the trouble,
because U+010C and U+010D are not part of Windows code page 1252.  If
you are running Windows 2000 or XP, Notepad can save as Unicode, but you
must explicitly tell it to do so (the default is ANSI).  Better to use
a Unicode-capable editor such as WordPad, Word, or SC UniPad instead.
(Windows code pages 1250 and 1257 do support the two Czech characters.)

Since you already know the Unicode code points, it would have been
easier by now to type the escape sequences (Universal Character Names)
directly:

\u010c
\u010d

Alternatively, if you use SC UniPad, there is an option to convert
directly to UCN (as Adam mentioned), without having to bother with
native2ascii.

-Doug Ewell
 Fullerton, California





Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/04/2003 02:52:25 PM jameskass wrote:

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of
tagging the runs of Latin text for their specific languages?

The plain-text file would be legible without that -- I don't think this is
an argument in favour of plane 14 tag characters. Preserving
culturally-preferred appearance would certainly require markup of some
form, whether lang IDs or for font-face and perhaps font-feature
formatting.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







discovering code points with embedded nulls

2003-02-05 Thread Erik.Ostermueller
Hello, all.

I'm dealing with an API that claims it doesn't support unicode characters with 
embedded nulls.
I'm trying to figure out how much of a liability this is.

What is my best plan of attack for discovering precisely which code points have 
embedded nulls
given a particular encoding?  Didn't find it in the maillist archive.
I've googled for quite a while with no luck.  

I'll want to do this for a few different versions of unicode and a few different 
encodings.
What if I write a program using some of the data files available at unicode.org?
Am I crazy (I'm new at this stuff) or am I getting warm?
Perhaps this data file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?

Algorithm:
INPUT: Name of unicode code point file
INPUT: Name of encoding (perhaps UTF-8)

Read code point from file.
Expand code point to encoded format for the given encoding.
Test all constituent bytes for 0x00.
Goto next code point from file.

Thanks in advance for any help,

--Erik O.






Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/05/2003 04:05:44 AM Andrew C. West wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

That is a possible technical solution to such variations, though specific
character+variant combinations would have to be approved and documented by
UTC. It's not the only solution, and might or might not be the best.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Andrew C. West wrote,

 Is this not what the variation selectors are available for ?

 And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
 about using them for characters other than mathematical symbols.


Yes, there seem to be additional variation selectors coming in 
Unicode 4.0 as part of the 1207 (is that number right?) new
characters.

(What happens if someone discovers a 257th variant?  Do they
get a prize?  Or, would they be forever banished from polite
society?)

The variation selectors could be a practical and effective method 
of handling different glyph forms.

But, consider the burden of incorporating a large amount of
variation selectors into a text file and contrast that with the
use of Plane Fourteen language tags.  With the P14 tags, it's
only necessary to insert two special characters, one at the
beginning of a text run, the other at the ending.

Jim Allan wrote,

 One could start with indications as to whether the text was traditional 
 Chinese, simplified Chinese, Japanese, Korean, etc. :-(
 
 But I don't see that there is anything particularly wrong with citing or 
 using a language in a different typographical tradition.
 ...

Neither do I.  I kind of like seeing variant glyphs in runs of text and
am perfectly happy to accept unusual combinations.

Perhaps those of us who deal closely with multilingual material
and are familiar with variant forms are simply more tolerant
and accepting.

 ... A linguistic 
 study of the distribution of the Eng sound might cite written forms with 
 capital letters from Sami and some from African languages, but need not 
 and probably should not be concerned about matching exactly the exact 
 typographical norms in those tongues, for _eng_ or for any other letter.

On the one hand, there's a feeling that insistence upon variant glyphs
for a particular language is provincial.  On the other hand, everyone
has the right to be provincial (or not).  IMO, it's the ability to
choose that is paramount.

If anyone wishes to distinguish different appearances of an acute
accent between, say, French and Spanish... or the difference of the
ogonek between Polish and Navajo... or the variant forms of
capital eng, then there should be a mechanism in place enabling 
them to do so.

Variation selectors would be an exact method with the V.S. characters
manually inserted where desired.  P14 tags would also work for this;
entire runs of text could be tagged and those runs could be properly
rendered once the technology catches up to the Standard.

Neither V.S. nor P14 tags should interfere with text processing
or break any existing applications.  There are pros and cons for
either approach.

Best regards,

James Kass
.




VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 The plain-text file would be legible without that -- I don't think this is
 an argument in favour of plane 14 tag characters. Preserving
 culturally-preferred appearance would certainly require markup of some
 form, whether lang IDs or for font-face and perhaps font-feature
 formatting.

Any Unicode formatting character can be considered as mark-up,
even P14 tags or VSs.

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Best regards,

James Kass
.




RE: discovering code points with embedded nulls

2003-02-05 Thread Rick Cameron
Are you sure the API doesn't support Unicode _characters_ with embedded
NULs? Or does it fail to support Unicode _strings_ with embedded NULs?

If it really is the former, no character in UTF-8 (except, of course,
U+) will include a NUL byte. In UTF-16, it will be any character of the
form U+00xx (that is, all the ASCII and Latin-1 characters) or U+xx00 (a
great miscellany of characters).

It's hard to believe that an API that accepts UTF-16 would not handle ASCII
and Latin-1 characters! So I think the restriction must be about embedded
U+ characters in strings.

If so, that's much less onerous - it's pretty weird to embed U+ in the
middle of a string, despite the fact that many Win32 API functions require
this!

- rick

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] 
Sent: Wednesday, 5 February 2003 8:43
To: [EMAIL PROTECTED]
Subject: discovering code points with embedded nulls


Hello, all.

I'm dealing with an API that claims it doesn't support unicode characters
with embedded nulls. I'm trying to figure out how much of a liability this
is.

What is my best plan of attack for discovering precisely which code points
have embedded nulls given a particular encoding?  Didn't find it in the
maillist archive. I've googled for quite a while with no luck.  

I'll want to do this for a few different versions of unicode and a few
different encodings. What if I write a program using some of the data files
available at unicode.org? Am I crazy (I'm new at this stuff) or am I getting
warm? Perhaps this data file:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?

Algorithm:
INPUT: Name of unicode code point file
INPUT: Name of encoding (perhaps UTF-8)

Read code point from file.
Expand code point to encoded format for the given encoding. Test all
constituent bytes for 0x00. Goto next code point from file.

Thanks in advance for any help,

--Erik O.






RE: discovering code points with embedded nulls

2003-02-05 Thread Marco Cimarosti
Erik Ostermueller wrote:
 I'm dealing with an API that claims it doesn't support 
 unicode characters with embedded nulls.
 I'm trying to figure out how much of a liability this is.

If by embedded nulls they mean bytes of value zero, that library can
*only* work with UTF-8. The other two UTF's cannot be supported in this way.

But are you sure you understood clearly? Didn't they perhaps write Unicode
*strings* with embedded nulls? In that case they could have meant that null
*characters* inside strings. I.e., they don't support strings containing the
Unicode character U+, because that code is used as a string terminator.
In this case, it would be a common and accepted limitation.

 What is my best plan of attack for discovering precisely 
 which code points have embedded nulls
 given a particular encoding?  Didn't find it in the maillist archive.
 I've googled for quite a while with no luck.  

The question doesn't make sense. However:

UTF-8: Only one character is affected (U+ itself);

UTF-16: In range U+..U+ (Basic Multilingual Plane), there are of
course exactly 511 characters affected (all those of form U+00xx or U+xx00),
484 of which are actually assigned. However, a few of these code points are
high or low surrogates, which means that also many characters in range
U+01..U+10 are affected.

UTF-32: All characters are affected, because the high byte of an UTF-32 unit
is always 0x00.

 I'll want to do this for a few different versions of unicode 
 and a few different encodings.

Most single and double-byte encodings behave like UTF-8 (i.e., a single
zero-byte is only needed to encode U+ itself).

 What if I write a program using some of the data files 
 available at unicode.org?
 Am I crazy (I'm new at this stuff) or am I getting warm?
 Perhaps this data file: 
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
 
 Algorithm:
 INPUT: Name of unicode code point file
 INPUT: Name of encoding (perhaps UTF-8)
 
 Read code point from file.
 Expand code point to encoded format for the given encoding.
 Test all constituent bytes for 0x00.
 Goto next code point from file.

That would be totally useless, I am afraid.

The only UTF for which this count makes sense is UTF-8, and the result is
one.

_ Marco




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Asmus Freytag
At 06:24 PM 2/5/03 +, [EMAIL PROTECTED] wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.


The minute you have scoped tagging, you are no longer using
plain text.

The P14 tags are no different than HTML markup in that regard,
however, unlike HTML markup they can be filtered out by a
process that does not implement them. (In order to filter
out HTML, you need to know the HTML syntax rules. In order
to filter out P14 tags you only need to know their code point
range.)

Variation selectors also can be ignored based on their code
point values, but unlike p14 tags, they don't become invalid
when text is cutpaste from the middle of a string.

If 'unaware' applications treat them like unknown combining
marks and keep them with the base character like they would
any other combining mark during editing, then variation
selectors have a good chance surviving in plain text.

P14 tags do not.

Unicode 4.0 will be quite specific: P14 tags are reserved for
use with particular protocols requiring their use is what the
text will say more or less.

A./






Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Peter_Constable

On 02/05/2003 12:24:39 PM jameskass wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Sure, but why do we want to place so much demand on plain text when the
vast majority of content we interchange is in some form of marked-up or
rich text? Let's let plain text be that -- plain -- and look to the markup
conventions that we've invested so much in and that are working for us to
provide the kinds of thing that we designed markup for in the first place.
Besides, a plain-text file that begins and ends with p14 tags is a
marked-up file, whether someone calls it plain text or not. We have
little or no infrastructure for handling that form of markup, and a large
and increasing amount of infrastructure for handling the more typical forms
of markup.

I repeat, plain text remains legible without anything indicating which eng
(or whatever) may be preferred by the author, and (since the requirement
for plain text is legibility) therefore this is not really an argument for
using p14 language tags. IMO.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485











RE: discovering code points with embedded nulls

2003-02-05 Thread Kenneth Whistler
Erik followed up:

 From what I'm hearing from you all is that a null 
 in UTF-8 is for termination and termination only.
 Is this correct?

Not quite. A null byte (0x00) in UTF-8 is only a
representation of the NULL character (U+). It can
be present in UTF-8 for whatever purposes one might use
a NULL in textual data.

One very common usage of a NULL is as a convention for
string termination. And if you are using NULL's that way,
then of course any API which depends on that convention
will have a problem with NULL characters embedded *in*
the string for other reasons, since they will prematurely
detect end-of-string in their processing.

If your string termination convention does *not* use
NULL (but instead some other mechanism such as explicit
length attributes), then there is no inherent reason why
you could not use NULL's for some other purpose embedded
in the string -- for example to delimit fielded data
within the string, or some other purpose. In such cases,
if your Unicode data is represented in the UTF-8 encoding
form, then those NULL's will end up as 0x00 embedded
bytes, because that is how NULL's characters are represented
in UTF-8.

--Ken





Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Michael Everson
At 16:47 -0500 2003-02-05, Jim Allan wrote:


There are often conflicting orthographic usages within a language. 
Language tagging alone does not indicate whether German text is to 
be rendered in Roman or Fraktur, whether Gaelic text is to be 
rendered in Roman or Uncial, and if Uncial, a modern Uncial or more 
traditional Uncial, whether English text is in Roman or Morse Code 
or Braille.

We have script codes (very nearly a published standard) for that.

By the way, modern uncial and more traditional uncial isn't 
really sufficient I think for describing Gaelic letterforms. See 
http://www.evertype.com/celtscript/fonthist.html for a sketch of a 
more robust taxonomy.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Asmus Freytag wrote,

 Variation selectors also can be ignored based on their code
 point values, but unlike p14 tags, they don't become invalid
 when text is cutpaste from the middle of a string.

Excellent point.

 Unicode 4.0 will be quite specific: P14 tags are reserved for
 use with particular protocols requiring their use is what the
 text will say more or less.

This seems to be an eminently practical solution to the P14
situation.

If I were using an application which invoked a protocol requiring
P14 tags to read a file which included P14 tags and wanted to cut
and paste text into another application, in a perfect world the
application would be savvy enough to recognize any applicable P14
tags for the selected text and insert the proper Variation Selectors
into the text stream to be pasted.

The application which received the pasted text, if it was an application
which used a protocol requiring P14 tags, would be savvy enough to
strip the variation selectors and enclose the pasted string in
the appropriate P14 tags.  If the pasted material was being inserted
into a run of text in which the same P14 tag applied, then the tags
wouldn't be inserted.  If the pasted material was being inserted
into a run of text in which a different P14 tag applied, then the
application would insert begin and end P14 tags as needed.

In a perfect world, in the best of both worlds, both P14 tags and
variation selectors could be used for this purpose.

Is it likely to happen?  Perhaps not.

But, by not formally deprecating P14 tags and using (more or less)
the language you mentioned, the possibilities remain open-ended.

Best regards,

James Kass
.




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 Sure, but why do we want to place so much demand on plain text when the
 vast majority of content we interchange is in some form of marked-up or
 rich text? Let's let plain text be that -- plain -- and look to the markup
 conventions that we've invested so much in and that are working for us to
 provide the kinds of thing that we designed markup for in the first place.
 Besides, a plain-text file that begins and ends with p14 tags is a
 marked-up file, whether someone calls it plain text or not. We have
 little or no infrastructure for handling that form of markup, and a large
 and increasing amount of infrastructure for handling the more typical forms
 of markup.

We place so much demand on plain text because we use plain text.

We continue to advance from the days when “plain text” meant ASCII only
rendered in bitmapped monospaced monochrome.

We don’t rely on mark-up or higher protocols to distinguish between different
European styles of quotation marks.  We no longer need proprietary rich-text
formats and font switching abilities to be able to display Greek and Latin
text from the same file.

 I repeat, plain text remains legible without anything indicating which eng
 (or whatever) may be preferred by the author, and (since the requirement
 for plain text is legibility) therefore this is not really an argument for
 using p14 language tags. IMO.

Is legibility the only requirement of plain text?  Might additional 
requirements
include appropriate, correct encoding and correct display?

To illustrate a legible plain text run which displays as intended (all things 
being
equal) yet is not appropriately encoded (this e-mail is being sent as plain 
text
UTF-8):

푰풇 풚풐풖 풄풂풏 풓풆풂풅 풕풉풊풔 
풎풆풔풔풂품풆...
풚풐풖 풎풂풚 풘풊풔풉 풕풐 풋풐풊풏 푴푨푨푨* 
풂풕
퓫퓵퓪퓱퓫퓵퓪퓱퓫퓵퓪퓱퓭퓸퓽퓬퓸퓶

(*헠햺헍헁 헔헅헉헁햺햻햾헍헌 헔햻헎헌햾헋헌 
헔헇허헇헒헆허헎헌)

Clearly, correct and appropriate encoding (as well as legibility) should be a 
requirement of plain text.  Is correct display also a valid requirement for 
plain text?

It is for some...

Respectfully,

James Kass
.




Re: list etiquette (was Re: Tailoring of normalization

2003-02-05 Thread Tex Texin
I don't know about others, but my filters place messages in different folders,
EXCEPT when my name is on the cc or to list.
In that case, the message is left in my inbox for more immediate review and
possible response.

The Unicode lists are also slow to send mail so there can be a significant
delay (an hour or more at times). Having the name on the mail can speed
responses, although its perhaps unjust to people not named on the to-list.

There probably isn't a one-size fits all solution, short of those not wanting
a response changing their reply-to address to [EMAIL PROTECTED].

tex


[EMAIL PROTECTED] wrote:
 
 On 02/04/2003 10:58:54 AM Rick McGowan wrote:
 
 (Number one: Please don't CC me on this discusion. I'm on the list and I
 don't need 2 copies of every mail.)
 
 I wish everyone made a habit of not cc'ing those already on the list -- I
 likewise don't like receiving personally-addressed copies of messages sent
 to the list.


-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: How to convert special characters into unicode?

2003-02-05 Thread Chris Jacobs




You mean like this?
The following is two times the zodiac [ U+2648 ... U+2653 ]
Mortbats Zodiac:
1234567890-=
[ Needs Mortbats font to display, http://www.dingbatpages.com]
Unicode Zodiac:
♈♉♊♋♌♍♎♏♐♑♒♓
[ Needs e.g. Arial Unicode MS to 
display ]

The upper of these two zodiacs will give wrong unicode 
numbers, the lower will give the right numbers.

- Original Message - 
From: "SRIDHARAN Aravind" 
[EMAIL PROTECTED]
To: "Chris Jacobs" [EMAIL PROTECTED]
Sent: Thursday, February 06, 2003 4:57 
AM
Subject: RE: How to convert special 
characters into unicode?
Yes, I have Czech language by default in my 
windows 2000.But the thing is that when I convert it into unicode, it goes 
corrupt - in the sense- the unicode value gets corrupt.I don't know 
why?I believe Notepad just fools around me.It just make me believe my 
eyes that the value is the desired special character.Thank 
you.Aravind