Re: Feedback from C1 Control Pictures Proposal

2011-08-22 Thread Frank da Cruz
. This is why the code points should be standardized: the recipient of the email should be able to see the same glyph that the sender saw. And by the same token, debugging techniques can be documented in plain text, with examples. Frank da Cruz http://www.columbia.edu/~fdc/

Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-11 Thread Frank da Cruz
Doug Ewell wrote: ... There was a time, about 10 years ago, when Frank da Cruz would have replied almost immediately about the importance of C1 controls in terminal environments, and the arguments about incompatibility between 8859-1 and Windows-1252 would have been off and running

Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

2004-05-25 Thread Frank da Cruz
And what is KOI-7? A true 7-bit encoding for Russian, in which Cyrillic letters (small and capital respectively) were encoded in the ranges where ASCII has Latin letters (capital and small respectively). The KOI-7 I saw when I was in the USSR in the 1980s was this one:

MIME-aware recode or iconv?

2004-01-15 Thread Frank da Cruz
Is anybody aware of a Unix stdin/stdout application (suitable for piping) that converts a text stream from one character encoding to another based on its MIME headers (as you would find, for example, in an email message)? Applications such as iconv and recode need the source character set

RE: American English translation of character names

2003-12-18 Thread Frank da Cruz
Yes, I did both cards and punched paper tape as a teenager. I did them too. Nothing to do with Unicode, but those who would like an introduction to punched cards and early computing (mainly IBM oriented) are welcome to take a look at this: http://www.columbia.edu/acis/history/

Re: [OT]

2003-12-09 Thread Frank da Cruz
[EMAIL PROTECTED] wrote: Stout was indeed given as a health drink in small doses in certain cases, it's one of the few foods that are a good source of both iron and calcium. However the only doctor I've heard of recommending it in recent years was... I know of an (Irish) obstetrician in NYC

Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-22 Thread Frank da Cruz
Jonathan Coxhead [EMAIL PROTECTED] wrote: On 22 Oct 2003, at 6:53, John Cowan wrote: Kent Karlsson scripsit: Don't know about LF, CR. I think that should be two line ends. I agree. I don't know any system that uses this sequence. The BBC Micro---well-known to a generation of

Re: Line Separator and Paragraph Separator

2003-10-20 Thread Frank da Cruz
Are the LS and PS characters actually used in real plain-text documents? At some point in the early 1990s, the thinking was that ASCII control characters were included in Unicode only for round-trip compatibility with existing character sets, but their semantics were undefined, and anyway they

Re: DEC-MCS mapping, anyone?

2003-10-11 Thread Frank da Cruz
I've added a DEC MCS table to the character tables at: http://www.columbia.edu/kermit/csettables.html - Frank

Re: Damn'd fools

2003-07-26 Thread Frank da Cruz
United Kingdom of Great Britain as opposed to the present United Kingdom of Great Britain and Northern Ireland. The whishful misnomer United Kingdom of course refers to the union of the erstwhile independant kingdoms of England (including the principality of Wales and various other

Re: Damn'd fools

2003-07-26 Thread Frank da Cruz
Thanks for the corrections -- see I told you :-) Queen Victoria was of course Empress of India, not Emperor. No other British monarch had that title. It's printed on coins of Edward VII: http://hiwaay.net/~hfears/UK/ed7/P_1902.htm and George V:

Re: Damn'd fools

2003-07-25 Thread Frank da Cruz
Changing (and worse, recycling) 3166 Alpha-2 codes puts us in mind of all sorts of database-related disasters, but that's not all. Think of: . Top-level Internet domains. Imagine the possibilities for spoofing during the transition period. . Postal-code country prefixes, which are

Re: French group separators

2003-07-07 Thread Frank da Cruz
At Mon, 7 Jul 2003 17:12:25 +0100, Michael Everson wrote: At 11:49 -0400 2003-07-07, John Cowan wrote: It's a typewriter-based convention, and is suitable for monowidth fonts only. It's a beastly practice held over from the time when it was useful (that is, when typesetters set the type

Re: French group separators

2003-07-07 Thread Frank da Cruz
Unicode already defines with character properties those punctuations that terminate sentences. Why would you need to recognize sequences of two spaces as meaning an end of sentence??? This would be wrong to select sentenced in a preformated plain-text, even in English... Because it has

Re: French group separators

2003-07-07 Thread Frank da Cruz
Mon, 7 Jul 2003 19:41:21 +0100 Michael Everson wrote: At 14:27 -0400 2003-07-07, Frank da Cruz wrote: EMACS aside, it's still an interesting question why -- in English at least -- it was customary thoughout the 20th century to put two spaces after a period when typing. I expect it must

Re: French group separators

2003-07-07 Thread Frank da Cruz
It is worth noting that what is described here is the default running mode of Emacs for the English locale. There are a lot more modes on Emacs to handle various languages (including programming languages). Of course. But without two spaces you have greater ambiguity, at least in English: In

Re: [ot] anyone know of a good sending accessible emails guideline page?

2003-06-20 Thread Frank da Cruz
does anyone know of a simple, explanatory web page, aimed at not too technical people, based on sending *accessible* email, and if really necessary attachments and the problems related to attachments (specifically inaccessibly, not viruses). i'm looking for a nice concise web page that i

Missing native-script country names

2003-03-29 Thread Frank da Cruz
For: http://www.columbia.edu/kermit/postal.html#index which is coming along quite nicely, thanks to many in this group... Can anyone supply UTF-8 native-script names of the following countries? Bangladesh Comoros Laos Maldives (if they use a non-Roman script) Mauritania (ditto)

Re: Arabic country names

2003-03-21 Thread Frank da Cruz
Edward H Trager [EMAIL PROTECTED] wrote (about how to find Arabic country names): You need to download IBM's very thorough International Components for Unicode library which is available under an Open Source license at: http://oss.software.ibm.com/icu/download/2.4/index.html ...there is

Arabic country names

2003-03-20 Thread Frank da Cruz
It would seem timely to augment the collection of native-script UTF-8 country names in: http://www.columbia.edu/kermit/postal.html#index with more Arabic ones. So far, Arabic is the most under-represented script. I have a few (Egypt, Iran, Tajikistan) cribbed from Tex's page but would like

Gothic

2003-03-17 Thread Frank da Cruz
I received from Aurlien Coudurier a picture of the I Can Eat Glass sentence in Gothic. This was my first adventure with constructing a UTF-8 string for non-BMP characters (I also have a few of these for Vietnamese Nm but they were sent in by B Phc, a.k.a. James Do). The result is here:

Re: geometric shapes

2003-03-13 Thread Frank da Cruz
I've got a few questions about the use of geometric shapes, like squares and such. Some of these look very similar to one another, and I don't know which ones to use in which circumstances! Are their any guidelines on their use? Just as an example, let's look at the squares. These come in

Re: geometric shapes

2003-03-13 Thread Frank da Cruz
Pim Blokland schreef: Frank da Cruz schreef (e.g. VT220) or PC code page (e.g. CP437) can reveal such things. I really was speaking about the geometric shape range (U+25A0 through U+25FF), not about the box drawing characters (U+2500..U+257F) and block elements (U+2580..U+259F), which I

Caron / Hacek?

2003-03-04 Thread Frank da Cruz
I just noticed that upper and/or lower case letters D, I, L, and T with caron (hacek) are sometimes displayed with an apostrophe instead of a caron (and sometimes not). Is there any rhyme or reason to this? - Frank

Character set tables

2003-03-04 Thread Frank da Cruz
Some of you might find these tables useful: http://www.columbia.edu/kermit/csettables.html As time permits, I'll more. - Frank

Re: BOM's at Beginning of Web Pages?

2003-02-17 Thread Frank da Cruz
On Mon, 17 Feb 2003 08:13:51 -0500 (EST), Jungshik Shin [EMAIL PROTECTED] wrote: Incidentally, it just occurred to me that ftp/ssh clients may offer an user-configurable option for the automatic removal of 'UTF-8 BOM' at the beginning of a text file in UTF-8 when moving files from Windows

Re: Country names in native script

2003-01-24 Thread Frank da Cruz
Frank, feel free to take the country names out of my Unicode example page: http://www.i18nguy.com/unicode-example.html Already in UTF-8 for you. Perfect, thanks -- I borrowed the CJK ones, the Amharic ones, the Arabic and Hebrew ones, Hindustani, Bhutan, Khmer, and a couple Cyrillic names I

Re: Country names in native script

2003-01-21 Thread Frank da Cruz
Fuerstentum Liechtenstein may be also written as Fürstentum Liechtenstein, of course. I'm not sure, but I think Luxembourg should be Lëtzeburg. Thanks, that's correct -- I have that on the glass page already. This new project only came into my head last night so I have added just a few

Country names in native script

2003-01-20 Thread Frank da Cruz
Hi all. In the spirit of I can eat glass, but more usefully, I took a few minutes to convert my international postal addresses page to UTF-8: http://www.columbia.edu/kermit/postal.html and added some Greek and Cyrillic to Appendix II (the table of country names). Anybody who would like to

RE: Small Latin Letter m with Macron

2003-01-16 Thread Frank da Cruz
The convention of using a horizontal line to mark an abbreviation, often the omission of m or n, goes back to the middle ages (if not earlier) and was often used in early printed books; apparently it has lived on in some handwriting, to judge from your post. It was used in English too, see:

RE: The result of the plane 14 tag characters review.

2002-11-18 Thread Frank da Cruz
As a result of being monofont plain text viewers/editors are also notorious for not supporting much beyond a limited repertoire of characters [a few noble exceptions to this rule notwithstanding]. Unless a widely used plain-text protocol requires or supports these characters, they remain

RE: World Address Project starts and relies on Unicode heavily

2002-10-07 Thread Frank da Cruz
Don't forget the ever-popular Frank's Compulsive Guide to Postal Addresses: http://www.kermit-project.org/postal.html Some day when I'm caught up, I'll convert it to UTF-8 and add some text in native scripts. - Frank

Re: Romanized Cyrillic bibliographic data--viable fonts?

2002-08-26 Thread Frank da Cruz
Gory details: ... The specified Romanization for each of these Cyrillic characters includes a ligature over the top of the two Latin code points in question (to indicate that the Latin characters represent a single Cyrillic character presumably). If you can use horizontal bars over the

Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-12 Thread Frank da Cruz
A propos of this long thread about display of combining macrons in Middle English, morphing from tildes on vowels: ... Please note that both the UTC and WG2 have approved a new set of combining double accents: U+035D COMBINING DOUBLE BREVE U+035E COMBINING DOUBLE MACRON U+035F

Re: Tildes on vowels

2002-08-11 Thread Frank da Cruz
Consider the recent example offered by Frank da Cruz, which uses the superscript i. Thus Þe (The) might be written Yⁱ. (If you have au_courant.ttf installed and can actually display it.) In HTML, that might be written as Ysuperscripti/superscript That's mark-up. As a visual aid

Re: Tildes on vowels

2002-08-11 Thread Frank da Cruz
Frank, which font are you using? Arial Unicode MS has the problems you describe. If you use James Kass CODE2000 you can see them. I know. But with regular Windows fonts installed you don't seem to make out very well. It surprises me that combining macron doesn't combine! In whatever fonts

Re: Tildes on vowels

2002-08-11 Thread Frank da Cruz
The combining macron over the gh isn't complete, it is like a macron over each letter individually. I changed this just now (at James's suggestion) to Combining Overline. Thanks. - Frank

Re: Tildes on vowels

2002-08-11 Thread Frank da Cruz
The page seems to be encoded correctly. MSIE sometimes displays UTF-8 encoded material a bit differently from the same material encoded as NCRs. MSIE has no direct font setting for UTF-8 material, but one trick is to set both the Latin font and the User Defined font to the desired font

Re: Pronunciation of U+0429 (was RE: Digraphs as Distinct Logical Uni ts)

2002-08-08 Thread Frank da Cruz
I will take a walk to the other side of our building and visit a Russian software consulting company (they represent Russian software companies in the US). Let's see how many different opinions I'll get there. ;-) Yes, please! I had four different Russian teachers and one of them was

Re: Q: Filesystem Encoding

2002-07-10 Thread Frank da Cruz
Barry Caplan [EMAIL PROTECTED] wrote: But be aware that such filenames may or may not be able to be transferred *across* file systems. Not only that, but, although I haven't tested in detail for a while, I would not be fully comfortable with middleware that is responsible for managing file

RE: Rotated Glyphs

2002-06-20 Thread Frank da Cruz
Thanks to Jungshik Shin for the solution to the problem and to Marco for his comments; a corrected page reflecting both is up: http://www.columbia.edu/kermit/glass.html (if you looked at it before, you'll need to refresh the images). I also added a bit more about BIDI, using the Hebrew

Re: UTF8 file transfer and interoperability problem

2002-06-07 Thread Frank da Cruz
I'd like to know how file transfer works, for filenames encoded in UTF8, using FTP... So what happens when Windows receive the UTF8 filenames via file transfer from a Linux/Unix machine? I don't know how it works using standard Windows and Linux tools, but I know how it works using the

Need a Japanese book title

2002-04-29 Thread Frank da Cruz
Now that UTF-8 on the Web is no longer such a novelty, I'm starting to encode some more pages that way. For a start, a bibliography of Kermit protocol and software: http://www.columbia.edu/kermit/biblio.html The main benefit at present being a Russian title at the bottom. Item number 7 on

Japanese book title (got it)

2002-04-29 Thread Frank da Cruz
What a group -- I ask and five minutes later I've got it: http://www.columbia.edu/kermit/biblio.html Thanks, Deborah Goldsmith! Speed Kanji -- It's the Speed Accordian of the new millenium :-) - Frank

When was U+xxxx added?

2002-04-11 Thread Frank da Cruz
Given a Unicode encoding value U+ (or whatever for non-BMP), how can I find out the version of the Unicode standard in which this character first appeared? - Frank

Re: Unicode Latin combining diacritics - Looking for real-world example documents

2002-04-02 Thread Frank da Cruz
We're doing some testing of Latin Diacritic support for IPA and African languages, romanizations, etc., and it is (understandably) very hard to find any real text in languages that require this support... Well, so far we have I can eat glass in Yoruba and Twi:

Re: Keyboard mapping on Windows XP?

2002-01-31 Thread Frank da Cruz
Recently I got Windows XP. Now I need to fix the keyboard. On Windows 98 I used to use the great ZDKeyMap utility (a virtual driver available at zdnet.com) to remap several keys on my keyboard. This utility doesn't work with Windows XP. Does anyone out there have a keyboard

Re: Does anyone know if there is a convertor that can convert UTF-8 to Shift-JIS?

2001-07-31 Thread Frank da Cruz
James Kass wrote: Foster Feng wrote: Does anyone know if there is a convertor that can convert UTF-8 to Shift-JIS? Try uniconv.exe by Basis Technology. It is distributed for free as a demo of the Rosette library; download from http://rosette.basistech.com/demo.html It's a big

Re: Genesis v. UDHR?

2001-05-29 Thread Frank da Cruz
Trying to translate an English sentence often causes problems. Does hurt mean 1. Injure 2. Cause pain to 3. Both? I believe the intention of the sentence I can eat glass and it doesn't hurt me is to convey the idea that the speaker is... eccentric, which would characterize someone who

Re: Genesis v. UDHR?

2001-05-27 Thread Frank da Cruz
I can provide you Pashto, Dari (Farsi) and Urud. do you have specific phrase or should I provide any? It's a silly phrase; I used because it was already written in many languages -- I converted them to UTF-8 and then added more languages: I can eat glass and it doesn't hurt me.

RE: Genesis v. UDHR?

2001-05-26 Thread Frank da Cruz
As I am interested in finding any texts in Unicode (UTF8) in any language. I must admit that in most cases the more interesting scripts (LR, such as Hebrew, Arabic, Farsi, Urdu, or combining such as Khmer) do not have the source available as UTF8 text, but only as images. In order to test

Re: [OT] bits and bytes

2001-05-18 Thread Frank da Cruz
Now let me ask a slightly different question: Prior to Unicode and ISO 10646, what were the smallest and largest size code units ever used for representing character data? Any characters bigger than 9 bits smaller than 6? Of course, Baudot was 5-bit code used widely in Teletype networks,

Re: Support for UTF-8 in ISO-2022/6429 terminals

2001-05-11 Thread Frank da Cruz
DM Now, we added UTF-8 support to the ANSI task following the DM ISO-IR 196 specification. I assume we're talking about some kind X-based terminal emulator? DM Does anyone know of any examples of host computers or operating DM systems that actually use UTF-8 on an ISO 6429 implementation?

Re: Invalid char display (was: Using hex numbers considered a geek attitude)

2001-04-27 Thread Frank da Cruz
There is a character set missing from Unicode. Unicode needs a special hex display font. Unicode and fonts are two different things. However, I agree it would be nice to have a repertoire of characters whose glyphs are hex values, and proposed this a couple years ago:

[unicode] Re: UCS-2 Files

2001-03-22 Thread Frank da Cruz
On Thu, 22 Mar 2001 15:00:55 -0500, Jeff Guevin [EMAIL PROTECTED] wrote: On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote: Better if you also keep the distinction between "octet" (a series of 8 bits) and "byte" (a series of n bits, where n is often but NOT always 8). When is a byte not

Re: UTF-8, C1 controls, and UNIX

2001-03-02 Thread Frank da Cruz
On Thu, 1 Mar 2001 11:00:45 -0800 (GMT-0800), Frank da Cruz [EMAIL PROTECTED] wrote: This information may be a bit outdated, since it is more than a decade since I worked daily with VMS. VMS is an example of a platform that really, really takes advantage of ISO standards

Re: UTF-8, C1 controls, and UNIX

2001-03-01 Thread Frank da Cruz
On Wed, 28 Feb 2001, Frank da Cruz wrote: [...] Cyrillic letters (e.g. capital A through PE). Most UNIX terminal drivers treat incoming C1 controls like their C0 counterparts, so 0x83 == 0x03 == Ctrl-C, which interrupts whatever process you are talking to. Similarly 0x84 == Ctrl-D

Re: UTF-8, C1 controls, and UNIX

2001-03-01 Thread Frank da Cruz
I don't understand this part of your rhetoric here. In UTF-8, *ASCII* is sacrosanct, not just "/". Right, sorry. I withdraw my point about VMS and other pathnames. And as for your overall point, I don't know of any claim that UTF-8 was designed for "transparent usability with hosts that

UTF-8, C1 controls, and UNIX

2001-02-28 Thread Frank da Cruz
The idea behind UTF-8 is to be able to use it in non-Unicode-aware UNIX versions: It lets you have Unicode filenames, Unicode directory names, Unicode file contents, Unicode email, etc. But what it does not do is let you *type* Unicode into regular UNIX applications or shells, if the UTF-8

Re: UTF-8, C1 controls, and UNIX

2001-02-28 Thread Frank da Cruz
Maybe one should make a transmission safe UTF that left C1 alone? Remember this? -- From: Markus Scherer [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800) Subject: What if UTF-8 had been defined after UTF-16? What if UTF-8 had been

Re: Latin digraph characters (was: Re: Klingon silliness)

2001-02-27 Thread Frank da Cruz
Oops, sorry, don't bother to tell me, it starts with an X, not a K. - Frank

Re: [OT] What is DEL for?

2001-02-21 Thread Frank da Cruz
Which systems interpret 0x7F as "interrupt process"? I know that this would be 0x03 in DOS (^C), and 0x03, 0x04 or 0x1A in Unix (^C, ^D, and ^Z, respectively), but I know nothing about other systems, e.g. Macintosh. Very long ago, in the Seventh Edition of Unix, the default interrupt

Re: Unicode-aware FTP client

2001-01-08 Thread Frank da Cruz
Could you kindly explain what does "Unicode-aware FTP client" mean ? As I understand, the original FTP specification does not transfer any charset information. How does your FTP client AWARE of Unicode ? Do you mean you implement RFC 2640 ? No, I mean the client controls everything,

Re: Unicode-aware FTP client

2001-01-08 Thread Frank da Cruz
Hum... interesting. What will you suggest us (Mozilla) to do to enhance our FTP browser to support similar thing ? Spend 20 years doing the research and writing the code, like I did? :-) Seriously, let's continue this offline. - Frank

Unicode-aware FTP client

2001-01-07 Thread Frank da Cruz
I posted a message here about a month ago about C-Kermit 7.1, which now includes a Unicode-aware FTP client. The second Alpha test has just been announced: http://www.columbia.edu/kermit/ck71.html The first Alpha test converted character sets of text files, but did not do anything about

Unicode-aware ftp client

2000-12-12 Thread Frank da Cruz
Hi folk. The Kermit Project at Columbia University (a Unicode Consortium member) is happy to announce a Unicode-aware FTP client for UNIX (potentially all varieties: Linux, AIX, Solaris, etc etc), available now for testing: http://www.columbia.edu/kermit/ftpclient.html In fact, it's a new

Re: TXT file that displays actual Unicode characters only

2000-11-13 Thread Frank da Cruz
In looking at the Unicode Consortium site, I see a variety of txt and html files that give a description of characters in different unicode blocks, but I have not yet found a text, doc, or html file that simply contains the actual unicode characters, either in the standard's entirety, or by

Re: Is this in Unicode?

2000-10-12 Thread Frank da Cruz
At 03:09 AM 10/12/2000 -0800, Michael Everson wrote: Well, John, it might be helpful if I could see the other characters in the font, as this might put the character in context. Having said that, I don't recognize this particular one, but it reminds me of a symbol which can be used to

Re: Correct definition for an isLatin1() function

2000-10-05 Thread Frank da Cruz
"Rogers, Paul" wrote: We're whipping up a little function named isLatin1() that returns true if the (UCS-2) string in question is "all Latin1". [snip] In other words, should we exclude the C0, C1, and Latin Extended code values? Including or excluding C0 and C1 is a matter of

Re: OT: Correct definition for an isLatin1() function

2000-10-05 Thread Frank da Cruz
Michael Kaplan RANTed: The assumption here is that the function will be run on Unicode text. Therefore, the various industrial and other code pages are irrelevant. Microsoft does not convert the characters it has in the control code range to those same code points in Unicode, does it? Indeed,

FTP and UTF-8

2000-09-24 Thread Frank da Cruz
Does anybody know of a publicly accessible FTP server that supports RFCs 2389 (negotiation of new features) and 2640 (internationalization)? Preferably one that allows anonymous uploads (for testing purposes)? In case you're not aware of these RFCs, they provide for UTF-8 based FTP. Thanks! -

Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Frank da Cruz
Erik van der Poel wrote: Frank da Cruz wrote: The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8

Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-12 Thread Frank da Cruz
Frank da Cruz [EMAIL PROTECTED] wrote: . If you send a code in the 0x80-8x9f range to such a terminal or emulator, it properly treats it as a control code. If it was intended as a graphic character ("smart quote" or somesuch) the result is a fractured screen, some

RE: Proposal to make the unicode list more transparent!

2000-07-12 Thread Frank da Cruz
This is, I think, a good idea. If we informally agreed to a syntax, like "use square brackets for the topic", then people could filter for things like "[CJK]". This might sound silly, but some people still use ISO 646-based displays, in which square brackets show up umlauts, etc.

Re: Euro character in ISO

2000-07-12 Thread Frank da Cruz
On Wed, 12 Jul 2000 10:43:59 -0800, Robert A. Rosenberg wrote: At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote: On Tue, 11 Jul 2000, Robert A. Rosenberg wrote: At 15:30 -0800 on 07/11/00, Asmus Freytag wrote: There has been an attempt to create a series of 'touched up' 8859

Re: Euro character in ISO

2000-07-12 Thread Frank da Cruz
On Wed, 12 Jul 2000, Frank da Cruz wrote: Perhaps you're suggesting the Unix 'mail' should become a translation agent between the character set of the mail and that of the user's terminal? I hope not, since given that practically any character set anybody can dream up is "