Re: [OT] reciting program code (was: [idn] nameprep forbidden

2001-02-08 Thread Asmus Freytag

No fair - please back up this claim with some evidence! ;-)

At 02:16 AM 9/20/00 -0800, Otto Stolz wrote:
>Not exactly chanting, I once have written a Pascal program in verse
>and rhyme. A colleague had asked for a missing subroutine in verse,
>so I supplied one in the same style.




Re: [OT] reciting program code (was: [idn] nameprep forbidden

2001-02-08 Thread Otto Stolz

At 02:16 AM 9/20/00 -0800, Otto Stolz wrote:
> Not exactly chanting, I once have written a Pascal program in verse
> and rhyme. A colleague had asked for a missing subroutine in verse,
> so I supplied one in the same style.

(I'd better had written: "once upon a time"; cf. infra.)

Am 2001-02-08 um 08:32 h UCT hat Asmus Freytag geschrieben:
> No fair - please back up this claim with some evidence! ;-)

I would love to do so, actually I would have included the
example in my original note -- if I only could. Alas, this
masterpiece of contemporary poetry has been lost together
with the computer it was designed for, years ago. Also,
the colleague mentioned has moved to another employer, so
he is no more available as a possible source.

Best wishes,
  Otto Stolz



Unicode collation algorithm - interpretation

2001-02-08 Thread J M Sykes

In the proposal for better accommodating UCS in SQL, we assumed that a
comparison performed according to UTR#10, "Unicode Technical Standard #10
Unicode Collation Algorithm", would require four parameters, viz.

Two strings to be compared

A collation element table

A maximum level as mentioned in UTR#10, section 4.3
"Form a sort key for each string", which specifies Step 3.

SQL already uses the term 'collation', each of which is identified by a
, but does not accommodate the notion that the same
collation element table can be applied at different levels.

In our proposal, we have assumed that  identifies a
collation element table, and have extended SQL syntax to allow the user to
specify the fourth parameter (or leave it to be defaulted).

It has been suggested that SQL  should instead identify both
collation element table and maximum level.

Perhaps the second approach might be useful in the case where, for reasons
of performance, sort keys are constructed in advance of being needed, for
example to be stored as 'shadow columns' in SQL base tables, or in indexes.

On the other hand, the first approach seems to be more user-friendly in the
case where at least two collation element tables are available, provided
their levels correspond (i.e. provided level 2 means 'case-blind' in both
cases).

Would anyone care to comment?

Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***





Article in Financial Times; Feb 7, 2001

2001-02-08 Thread J M Sykes

I got a reference to the following from ACM TechNews - Wednesday, February
7, 2001, so some may have seen it already.

http://globalarchive.ft.com/globalarchive/article.html?id=010207001454

It shows a degree of ignorance that I would hardly have believed possible in
a reputable newspaper. I know there are vocal members of this list who are
more knowledgeable on the subject than I, and I invite them to offer
corrections.

The email address for letters to the FT is [EMAIL PROTECTED]

If I don't hear that someone has accepted this invitation, I shall do my own
humble best.

As a taster, I append a few quotes:



Until recently, even the accents commonly used above and below letters in
German, French and Spanish could cause problems. This was because the
original system for representing letters on computers - known as Ascii
coding - set in stone only half of the 256 codes available to identify
different characters. (The basic unit of computing data used to do this is
the byte, each of which has eight bits, which can be either 0 or 1. That
gives 256 alternatives.)

This system covered the standard upper and lower case alphabet, numbers,
punctuation and a few special symbols found on keyboards, such as currency
and percentage symbols.

The other 128 codes could be used arbitrarily. Printers used them to create
special effects. Software applications used them, among other things, to
represent accented characters. But different applications adopted different
standards. This is why you still occasionally see gobbledegook in e-mail
messages from other countries.

The International Standards Organisation (ISO) has now agreed to give
standard meanings to these remaining codes. The new standard is known as
'Latin-1' or 'extended Ascii' and includes accented characters.


Note the "now", and "new".

Another coding system has been devised to cater for Asian languages such as
Chinese, Japanese and Korean. These languages have thousands of ideographic
characters, each representing a single word. A coding system called Unicode
has emerged as the standard. This uses twice as much data to represent each
character, and so is known as a 'double-byte' coding system.


Note the "Another"


There will be improvements as the Unicode standard becomes adopted in the
next version of HTML - the computer language that underpins information
display on the web.


?


More information is available via the website: www.worldnames.com

Copyright: The Financial Times Limited


Even allowing that journalists can't be expected to be experts on
everything, and some standards take a long time to be widely adopted (and
some never are!), the extracts above seem to me to give a rather distorted
picture.

Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***





Re: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread Michael Everson

At 04:48 -0800 2001-02-08, J M Sykes quoted the FT:

>The International Standards Organisation (ISO) has now agreed to give
>standard meanings to these remaining codes.

Which as everyone knows, is really the International Organization for
Standardization (ISO).

Sigh.
--
Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire



RE: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread Marco Cimarosti

Mike Sykes wrote:
> http://globalarchive.ft.com/globalarchive/article.html?id=010207001454
> It shows a degree of ignorance that I would hardly have believed
> possible in a reputable newspaper.

"Technical" and "scientific" articles on most "reputable" newspapers are
often of that quality.

What worries me more is that I only notice when they occasionally talk about
things I know. I wonder what the hell they are telling me in fields that I
know nothing about.

The following are my favorite selections:


[Unicode] would be an inefficient means of representing English and European
languages that use the Roman alphabet because e-mails and text files would
be twice as big in terms of data. [...]


He clearly knows nothing about UTF-8. But don't tell him! Or he could also
find out about UTF-32, and tell the F.T. readers that because of Unicode
e-mails will soon be full of 4 letter words!


Since each word has a unique code, there is also less of the ambiguity that
is inherent in, for example, English, where a word such as 'wind' can be a
noun meaning 'movement of air' or a verb meaning to 'crank a handle'.
Unicode's unambiguous meanings improve the accuracy of automated translation
systems, especially when attempting translations from these languages. [...]


This goes in "The Benefits of Unicode": it improves the accuracy of
automated translation systems!


Since homophones exist, there may be more than one character for a
pronunciation - much like "colonel", the military rank, and "kernel", the
nut in a fruit, in English. [...]


But, hey!, then it also goes in "The Benefits of ASCII(tm)": it too improves
the accuracy of automated translation systems by encoding "colonel" and
"kernel" with unambiguous sequences!

_ Marco



Re: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread James E. Agenbroad

On Thu, 8 Feb 2001, Michael Everson wrote:

> At 04:48 -0800 2001-02-08, J M Sykes quoted the FT:
> 
> >The International Standards Organisation (ISO) has now agreed to give
> >standard meanings to these remaining codes.
> 
> Which as everyone knows, is really the International Organization for
> Standardization (ISO).
> 
> Sigh.
> --
> Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
> 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
> Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
> 27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire
> 
 Thursday, February 8, 2001
And the next sentence: "The new standard is known as 'Latin-1' or
'extended ASCII' and includes accented characters."  I'd say 'includes
*some* accented characters' just as  Latin-2, Latin-3 etc. include other
repertoires of accented characters and other alphabets needed for a
particular group.  And later: "Double-byte codes are a very efficient
means of storing ideographic characters, such as Chinese, since a whole
word is stored in the equivalent of the space for two letters. Since each
word has a unique code there is also less of the ambiguity that is
inherent in, for example English ... Unicode's unambiguous meanings
..." This begs for a definition of a Chinese word and seems unaware that
Unicode assigns codes to characters, not to words or their meanings. Later
he accurately enough describes one approach to the character input issue
and then leaps to the domain names issue.  I was unable to find
www.worldnames.com which he cites. I join Michael with a sigh.  Feel free
to use these thoughts as part of a response, please do not forward this to
him or the Financial Times.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread P. T. Rourke

Pehaps he meant http://www.worldnames.net/ ?  

> I was unable to find www.worldnames.com which he cites. 





The normalization form of the result of a dyadic operation.

2001-02-08 Thread J M Sykes

When an standard conformaing SQL-implementation concatenates two normalized
UCS strings, then it is required that the result be normalized (noting
Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).

My question is, supposing the NF of the two operands to be different, what
should be the NF of the result?

In its present state, our proposal specifies the result by referring to the
following table:

Table A
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKC  NFD   NFC
NFKC|NFKC NFKC  NFD   NFC
NFD |NFD  NFD   NFD   NFC
NFC |NFC  NFC   NFC   NFC

It has been suggested that the following would be preferable:


Table B
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKC  NFKD  NFKC
NFKC|NFKC NFKC  NFKD  NFKC
NFD |NFKD NFKD  NFD   NFC
NFC |NFKC NFKC  NFC   NFC

I have no confident opinion on this, and don't believe I could form one
without more practical experience than I'm ever likely to have. My very
tentative opinion, for what it's worth, is based on a preference for NFC
over NFKC.

Any offers?

Mike.


***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***






Re: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread J M Sykes

- Original Message -
From: "Michael Everson" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, February 08, 2001 1:49 PM
Subject: Re: Article in Financial Times; Feb 7, 2001


> At 04:48 -0800 2001-02-08, J M Sykes quoted the FT:
>
> >The International Standards Organisation (ISO) has now agreed to give
> >standard meanings to these remaining codes.
>
> Which as everyone knows, is really the International Organization for
> Standardization (ISO).

Quite, but I've become so accustomed to these two errors over the years that
these days they only makes me wince a little.

I believe 'ISO' was a compromise whose acceptability was reinforced by the
dictionary meaning:

iso- comb. form.
1 Used in wds adopted f. Gk and in Eng. wds modelled on these, and as a
freely productive pref., mainly in scientific and technical use, w. the
sense `equal'.

Which must be one of the most cryptic dictionary entries I've seen in a
month or two!

Mike.




Re: Unicode collation algorithm - Khmer/Cambodian

2001-02-08 Thread J M Sykes

I'm afraid you have the wrong bloke here, Maurice. The technicality of my
query may have ffoled you into thinking I'm a UTR#10 expert - far from it!

All I can do is cc your query to the Unicode list - and wish you luck,
naturally :-)

Mike.

- Original Message -
From: "Maurice Bauhahn" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, February 08, 2001 2:27 PM
Subject: Unicode collation algorithm - interpretation


> Hello Mike, from the U.K.!
>
> What I have seen of the Unicode collation algorithm makes me wonder
whether
> it will handle syllabic-based ordering! I specialise in
> which has (at least) six levels of priority within each syllable.
Hopefully
> SQL collation will be open to such difficult environments.
>
> http://www.bauhahnm.clara.net/KhmerSortingUnicodebeta.pdf
>
> Cheers,
>
> Maurice Bauhahn
> Brio Technology Europe (Technical Support)
> email: 
>
> Office Telephone: +44 (0) 1932 878404
> Home Tel: +44(0)1628 626068
>
>
>




FW: conversion of e-mail addresses

2001-02-08 Thread Magda Danish (Unicode)



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Thursday, February 08, 2001 8:26 AM
To: [EMAIL PROTECTED]
Subject: conversion of e-mail addresses


I have floppy disk with addresses from another computer's e-mail (I was
using the Microsoft e-mail system at the time).

I want to drop the e-mail addresses I have in a WAB file into my current
Lotus Notes e-mail program, but I've been told to convert it to unicode
first.

I looked on your site for a conversion program, but can't find any.  Where
can I look?

Rob Tonus
Campaign Manager
Waterloo Millennium Recreation Park
(519) 885-6980

see the project at http://www.city.waterloo.on.ca/mrp







UTF-8 support in Mac & AOL browsers

2001-02-08 Thread Glen Perkins

(Assume that whatever script you want to display is displayable if you were
to use a legacy encoding. I.e., assume that if you want to send Japanese
text in UTF-8, that the Mac is either a Japanese Mac or is using the JDK, so
displaying shift_JIS pages would work. I'm trying to determine what
percentage of users out there wouldn't be able to use Web forms if the
encoding were UTF-8, assuming I know the percentage of people using various
browsers.)

1) I have a pretty good idea of the UTF-8 handling issues of Windows
versions of IE and NN, but I don't know what happens on the Mac with these
two makes of browsers. I assume that version 3 or earlier won't work, except
for ASCII-only text. What I don't know is the extent to which versions 4 and
later will support UTF-8. Again, given the assumption I stated at the top,
what aspects of UTF-8 support will fail with which browsers/OS versions?
(Perhaps it will display, but form data isn't returned in UTF-8, or maybe
you have to have both the right browser version AND Mac OS 8 or later, or
whatever.)

2) Are the UTF-8 issues in the AOL versions of IE the same as in ordinary
IE?

Thanks,

Glen Perkins




RE: conversion of e-mail addresses

2001-02-08 Thread Marco Cimarosti

Rob Tonus wrote:
> I have floppy disk with addresses from another computer's 
> e-mail (I was
> using the Microsoft e-mail system at the time).
> 
> I want to drop the e-mail addresses I have in a WAB file into 
> my current
> Lotus Notes e-mail program, but I've been told to convert it 
> to unicode
> first.
> 

Someone pointed me to this just a few days ago:
http://freshmeat.net/projects/asrecod/
Unluckily, the instructions are in Russian.

However, I am not sure that this is what you need. This and other available
tools convert TEXT FILES from one character set to another. It is very
probably that your addresses file is NOT a text file, but rather some sort
of BINARY format.

So what you need to know is whether and how YOUR version of Lotus Notes is
able to import a contacts file from THAT e-mail system, and I bet that
Unicode has little or nothing to do with this.

Hoping this helps.
Marco Cimarosti




Re: conversion of e-mail addresses

2001-02-08 Thread David Starner

On Thu, Feb 08, 2001 at 10:26:02AM -0800, Marco Cimarosti wrote:
> Someone pointed me to this just a few days ago:
>   http://freshmeat.net/projects/asrecod/
> Unluckily, the instructions are in Russian.

Why would you point to this converter? Assuming this is for Unix only,
(which is true as far as I can tell), iconv, GNU recode and konwert
are all better programs for most uses. In any case, if he needed a
text converter, it sounds like he's running Windows, so a source archive
is probably not the best solution for him.

--  
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg



Re: The normalization form of the result of a dyadic operation.

2001-02-08 Thread Peter_Constable


On 02/08/2001 11:20:27 AM "J M Sykes" wrote:

>When an standard conformaing SQL-implementation concatenates two
normalized
>UCS strings, then it is required that the result be normalized (noting
>Unicode Standard Annex #15 Unicode Normalization Forms, Concatenation).

Yes. It must be understood that a concatenated string is not guaranteed to
be normalised until it is explicitly normalised, regardless of the state of
the operand strings.



>My question is, supposing the NF of the two operands to be different, what
>should be the NF of the result?
>
>In its present state, our proposal specifies the result by referring to
the
>following table:
>
>Table A
>===
>|Operand 2
> Operand 1  |NFKD NFKC  NFD   NFC
> -+
>NFKD|NFKD NFKC  NFD   NFC
>NFKC|NFKC NFKC  NFD   NFC
>NFD |NFD  NFD   NFD   NFC
>NFC |NFC  NFC   NFC   NFC
>
>It has been suggested that the following would be preferable:
>
>
>Table B
>===
>|Operand 2
> Operand 1  |NFKD NFKC  NFD   NFC
> -+
>NFKD|NFKD NFKC  NFKD  NFKC
>NFKC|NFKC NFKC  NFKD  NFKC
>NFD |NFKD NFKD  NFD   NFC
>NFC |NFKC NFKC  NFC   NFC




I'm trying to make sense of these tables. Apparently, Table A consistently
applies a precedence of NFC > NFD > NFKC > NFKD. (I.e. the form for the
result should be the same as that of the operand with the highest form
according to this ordering.) Apparently, Table B gives a precedence to K
forms (K > ~K), and a precedence to C over D (C > D), but the first
ordering (K > ~K) is given higher priority over the second ordering (C >
D).

Actually, I don't think I'd go for either. Certainly, table B has a
concern: precedence given to the compatibility decompositions that occur in
NFKD and NFKC -- this results in removing distinctions that, in certain
situations, might be important. Table B should only be used with caution.

Both tables have an anomalous characteristic: if one operand is NFC, then
the result is always to be composed, but if one operand is NFKC and the
other is decomposed, then the result goes in two directions depending upon
the K or ~K property of the other operand. Why? That seems rather strange
to me. If the "Kompatibility" issue is orthogonal to the (de)composition
issue (which these tables follow, and which I think makes sense), then I
would think either C should always take precedence over D, or vice versa.
If we extract a portion from each table (and simpily because the operation
is commutative), we find

Sub-table A
===
|Operand 2
 Operand 1  |NFKD NFD
+--
NFKC|NFKC NFD

Sub-table B
===
|Operand 2
 Operand 1  |NFKD NFD
+--
NFKC|NFKC NFKD


Tables A and B could have just as readily had

Sub-table A.a
===
|Operand 2
 Operand 1  |NFKD NFD
+--
NFKC|NFKD NFC

Sub-table B.a
===
|Operand 2
 Operand 1  |NFKD NFD
+--
NFKC|NFKD NFKC

and I think that wouldn't have been any more or less motivated. It still
wouldn't make sense to me, though: I would have expected D to always have
precedence over C, as in Tables A.b and B.b:

Table A.b
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKD  NFD   NFD
NFKC|NFKD NFKC  NFD   NFC
NFD |NFD  NFD   NFD   NFD
NFC |NFD  NFC   NFD   NFC

Table B.b
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKD  NFKD  NFKD
NFKC|NFKD NFKC  NFKD  NFKC
NFD |NFKD NFKD  NFD   NFD
NFC |NFKD NFKC  NFD   NFC

or for C to always take precedence over D, as in Tables A.c and B.c:

Table A.c
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKC  NFD   NFC
NFKC|NFKC NFKC  NFC   NFC
NFD |NFD  NFC   NFD   NFC
NFC |NFC  NFC   NFC   NFC

Table B.c
===
|Operand 2
 Operand 1  |NFKD NFKC  NFD   NFC
 -+
NFKD|NFKD NFKC  NFKD  NFKC
NFKC|NFKC NFKC  NFKC  NFKC
NFD |NFKD NFKC  NFD   NFC
NFC |NFKC NFKC  NFC   NFC


(What a lot of alternatives!)

For the reason described above, I think compatibility decomposition should
be avo

Special casing clarification

2001-02-08 Thread Carl W. Brown

Some of the special casing rules are not clear.

#   FINAL:  The letter is not followed by a letter of category L* (e.g. Ll,
Lt, Lu, Lm, or Lo).

What happens if the word with the final sigma is followed by a period or
comma etc.  It should be final.  But what about a hyphenated word?
Technically it is still followed by a letter.  The text needs clarification.

It seems that final should be when it is followed by a space before a letter
or followed by no more letters.

Another issue - case folding.  Case folding appears to be the same as a
shift to upper followed by a shift to lower.  The sigma adjustment is not
necessary because the two forms are adjacent and will not affect sort
sequences.  The consolidation if dotted and dotless i should not impact
compares in Turkish locations but the Lithuanian removal of u0307 (combining
dot above) after i will affect the Lithuanian locale.  However, this should
not affect other locales.  It is probably a good idea to do this for all
locales.

Carl