. This is why the code points should be standardized: the recipient
of the email should be able to see the same glyph that the sender saw. And
by the same token, debugging techniques can be documented in plain text,
with examples.
Frank da Cruz
http://www.columbia.edu/~fdc/
Doug Ewell wrote:
...
There was a time, about 10 years ago, when Frank da Cruz would have
replied almost immediately about the importance of C1 controls in
terminal environments, and the arguments about incompatibility between
8859-1 and Windows-1252 would have been off and running
And what is KOI-7?
A true 7-bit encoding for Russian, in which Cyrillic letters (small and
capital respectively) were encoded in the ranges where ASCII has Latin
letters (capital and small respectively).
The KOI-7 I saw when I was in the USSR in the 1980s was this one:
Is anybody aware of a Unix stdin/stdout application (suitable for
piping) that converts a text stream from one character encoding to
another based on its MIME headers (as you would find, for example, in
an email message)? Applications such as iconv and recode need the
source character set
Yes, I did both cards and punched paper tape as a teenager.
I did them too. Nothing to do with Unicode, but those who would like an
introduction to punched cards and early computing (mainly IBM oriented)
are welcome to take a look at this:
http://www.columbia.edu/acis/history/
[EMAIL PROTECTED] wrote:
Stout was indeed given as a health drink in small doses in certain
cases, it's one of the few foods that are a good source of both iron and
calcium. However the only doctor I've heard of recommending it in recent
years was...
I know of an (Irish) obstetrician in NYC
Jonathan Coxhead [EMAIL PROTECTED] wrote:
On 22 Oct 2003, at 6:53, John Cowan wrote:
Kent Karlsson scripsit:
Don't know about LF, CR. I think that should be two line ends.
I agree. I don't know any system that uses this sequence.
The BBC Micro---well-known to a generation of
Are the LS and PS characters actually used in real plain-text documents?
At some point in the early 1990s, the thinking was that ASCII control
characters were included in Unicode only for round-trip compatibility
with existing character sets, but their semantics were undefined, and anyway
they
I've added a DEC MCS table to the character tables at:
http://www.columbia.edu/kermit/csettables.html
- Frank
United Kingdom of Great Britain as opposed to the present United Kingdom
of Great Britain and Northern Ireland. The whishful misnomer United
Kingdom of course refers to the union of the erstwhile independant kingdoms
of England (including the principality of Wales and various other
Thanks for the corrections -- see I told you :-)
Queen Victoria was of course Empress of India, not Emperor. No other
British monarch had that title.
It's printed on coins of Edward VII:
http://hiwaay.net/~hfears/UK/ed7/P_1902.htm
and George V:
Changing (and worse, recycling) 3166 Alpha-2 codes puts us in mind of
all sorts of database-related disasters, but that's not all. Think of:
. Top-level Internet domains. Imagine the possibilities for spoofing
during the transition period.
. Postal-code country prefixes, which are
At Mon, 7 Jul 2003 17:12:25 +0100, Michael Everson wrote:
At 11:49 -0400 2003-07-07, John Cowan wrote:
It's a typewriter-based convention, and is suitable for monowidth
fonts only.
It's a beastly practice held over from the time when it was useful
(that is, when typesetters set the type
Unicode already defines with character properties those punctuations that
terminate sentences. Why would you need to recognize sequences of two spaces
as meaning an end of sentence??? This would be wrong to select sentenced in
a preformated plain-text, even in English...
Because it has
Mon, 7 Jul 2003 19:41:21 +0100 Michael Everson wrote:
At 14:27 -0400 2003-07-07, Frank da Cruz wrote:
EMACS aside, it's still an interesting question why -- in English at
least -- it was customary thoughout the 20th century to put two spaces
after a period when typing. I expect it must
It is worth noting that what is described here is the default running mode of
Emacs for the English locale. There are a lot more modes on Emacs to
handle various languages (including programming languages).
Of course. But without two spaces you have greater ambiguity, at least in
English: In
does anyone know of a simple, explanatory web page, aimed at not too
technical people, based on sending *accessible* email, and if really
necessary attachments and the problems related to attachments
(specifically inaccessibly, not viruses).
i'm looking for a nice concise web page that i
For:
http://www.columbia.edu/kermit/postal.html#index
which is coming along quite nicely, thanks to many in this group...
Can anyone supply UTF-8 native-script names of the following countries?
Bangladesh
Comoros
Laos
Maldives (if they use a non-Roman script)
Mauritania (ditto)
Edward H Trager [EMAIL PROTECTED] wrote (about how to find Arabic
country names):
You need to download IBM's very thorough International Components for
Unicode library which is available under an Open Source license at:
http://oss.software.ibm.com/icu/download/2.4/index.html
...there is
It would seem timely to augment the collection of native-script
UTF-8 country names in:
http://www.columbia.edu/kermit/postal.html#index
with more Arabic ones. So far, Arabic is the most under-represented
script. I have a few (Egypt, Iran, Tajikistan) cribbed from Tex's page
but would like
I received from Aurlien Coudurier a picture of the
I Can Eat Glass sentence in Gothic. This was my first adventure
with constructing a UTF-8 string for non-BMP characters (I also
have a few of these for Vietnamese Nm but they were sent in
by B Phc, a.k.a. James Do). The result is here:
I've got a few questions about the use of geometric shapes, like
squares and such.
Some of these look very similar to one another, and I don't know
which ones to use in which circumstances!
Are their any guidelines on their use?
Just as an example, let's look at the squares. These come in
Pim Blokland schreef:
Frank da Cruz schreef
(e.g. VT220) or PC code page (e.g. CP437) can reveal such things.
I really was speaking about the geometric shape range (U+25A0
through U+25FF), not about the box drawing characters
(U+2500..U+257F) and block elements (U+2580..U+259F), which I
I just noticed that upper and/or lower case letters D, I, L, and T
with caron (hacek) are sometimes displayed with an apostrophe instead
of a caron (and sometimes not). Is there any rhyme or reason to
this?
- Frank
Some of you might find these tables useful:
http://www.columbia.edu/kermit/csettables.html
As time permits, I'll more.
- Frank
On Mon, 17 Feb 2003 08:13:51 -0500 (EST),
Jungshik Shin [EMAIL PROTECTED] wrote:
Incidentally, it just occurred to me that ftp/ssh clients may offer an
user-configurable option for the automatic removal of 'UTF-8 BOM' at
the beginning of a text file in UTF-8 when moving files from Windows
Frank, feel free to take the country names out of my Unicode example
page:
http://www.i18nguy.com/unicode-example.html
Already in UTF-8 for you.
Perfect, thanks -- I borrowed the CJK ones, the Amharic ones, the Arabic
and Hebrew ones, Hindustani, Bhutan, Khmer, and a couple Cyrillic names I
Fuerstentum Liechtenstein may be also written as Fürstentum
Liechtenstein, of course. I'm not sure, but I think Luxembourg should be
Lëtzeburg.
Thanks, that's correct -- I have that on the glass page already. This
new project only came into my head last night so I have added just a few
Hi all. In the spirit of I can eat glass, but more usefully, I took a few
minutes to convert my international postal addresses page to UTF-8:
http://www.columbia.edu/kermit/postal.html
and added some Greek and Cyrillic to Appendix II (the table of country
names). Anybody who would like to
The convention of using a horizontal line to mark an abbreviation, often
the omission of m or n, goes back to the middle ages (if not earlier)
and was often used in early printed books; apparently it has lived on in
some handwriting, to judge from your post.
It was used in English too, see:
As a result of being monofont plain text viewers/editors are also notorious
for not supporting much beyond a limited repertoire of characters [a few
noble exceptions to this rule notwithstanding].
Unless a widely used plain-text protocol requires or supports these
characters, they remain
Don't forget the ever-popular Frank's Compulsive Guide to Postal
Addresses: http://www.kermit-project.org/postal.html
Some day when I'm caught up, I'll convert it to UTF-8 and add some
text in native scripts.
- Frank
Gory details:
...
The specified Romanization for each of these Cyrillic characters
includes a ligature over the top of the two Latin code points in
question (to indicate that the Latin characters represent a single
Cyrillic character presumably).
If you can use horizontal bars over the
A propos of this long thread about display of combining macrons in
Middle English, morphing from tildes on vowels:
...
Please note that both the UTC and WG2 have approved a new set
of combining double accents:
U+035D COMBINING DOUBLE BREVE
U+035E COMBINING DOUBLE MACRON
U+035F
Consider the recent example offered by Frank da Cruz,
which uses the superscript i.
Thus Þe (The) might be written Yⁱ.
(If you have au_courant.ttf installed and can actually display it.)
In HTML, that might be written as Ysuperscripti/superscript
That's mark-up.
As a visual aid
Frank, which font are you using?
Arial Unicode MS has the problems you describe. If you use James Kass
CODE2000 you can see them.
I know. But with regular Windows fonts installed you don't seem to make
out very well. It surprises me that combining macron doesn't combine!
In whatever fonts
The combining macron over the gh isn't complete, it is like a macron
over each letter individually.
I changed this just now (at James's suggestion) to Combining Overline.
Thanks.
- Frank
The page seems to be encoded correctly.
MSIE sometimes displays UTF-8 encoded material a bit differently
from the same material encoded as NCRs.
MSIE has no direct font setting for UTF-8 material, but one trick
is to set both the Latin font and the User Defined font to
the desired font
I will take a walk to the other side of our building and visit a
Russian software consulting company (they represent Russian software
companies in the US). Let's see how many different opinions I'll get
there. ;-)
Yes, please! I had four different Russian teachers and one of them
was
Barry Caplan [EMAIL PROTECTED] wrote:
But be aware that such filenames may or may not be able to be transferred
*across* file systems. Not only that, but, although I haven't tested in
detail for a while, I would not be fully comfortable with middleware that is
responsible for managing file
Thanks to Jungshik Shin for the solution to the problem and to Marco for
his comments; a corrected page reflecting both is up:
http://www.columbia.edu/kermit/glass.html
(if you looked at it before, you'll need to refresh the images). I also
added a bit more about BIDI, using the Hebrew
I'd like to know how file transfer works, for filenames encoded in UTF8,
using FTP... So what happens when Windows receive the UTF8
filenames via file transfer from a Linux/Unix machine?
I don't know how it works using standard Windows and Linux tools, but I know
how it works using the
Now that UTF-8 on the Web is no longer such a novelty, I'm starting to
encode some more pages that way. For a start, a bibliography of Kermit
protocol and software:
http://www.columbia.edu/kermit/biblio.html
The main benefit at present being a Russian title at the bottom.
Item number 7 on
What a group -- I ask and five minutes later I've got it:
http://www.columbia.edu/kermit/biblio.html
Thanks, Deborah Goldsmith! Speed Kanji -- It's the
Speed Accordian of the new millenium :-)
- Frank
Given a Unicode encoding value U+ (or whatever for non-BMP), how can
I find out the version of the Unicode standard in which this character
first appeared?
- Frank
We're doing some testing of Latin Diacritic support for IPA and African
languages, romanizations, etc., and it is (understandably) very hard to
find any real text in languages that require this support...
Well, so far we have I can eat glass in Yoruba and Twi:
Recently I got Windows XP. Now I need to fix the keyboard.
On Windows 98 I used to use the great ZDKeyMap utility (a virtual driver
available at zdnet.com) to remap several keys on my keyboard. This utility
doesn't work with Windows XP.
Does anyone out there have a keyboard
James Kass wrote:
Foster Feng wrote:
Does anyone know if there is a convertor that can
convert UTF-8 to Shift-JIS?
Try uniconv.exe by Basis Technology.
It is distributed for free as a demo of the Rosette
library; download from
http://rosette.basistech.com/demo.html
It's a big
Trying to translate an English sentence often causes problems.
Does hurt mean
1. Injure
2. Cause pain to
3. Both?
I believe the intention of the sentence I can eat glass and it doesn't
hurt me is to convey the idea that the speaker is... eccentric, which
would characterize someone who
I can provide you Pashto, Dari (Farsi) and Urud.
do you have specific phrase or should I provide
any?
It's a silly phrase; I used because it was already
written in many languages -- I converted them to UTF-8
and then added more languages:
I can eat glass and it doesn't hurt me.
As I am interested in finding any texts in Unicode (UTF8) in any language.
I must admit that in most cases the more interesting scripts (LR, such as
Hebrew, Arabic, Farsi, Urdu, or combining such as Khmer) do not have the
source available as UTF8 text, but only as images. In order to test
Now let me ask a slightly different question: Prior to Unicode and ISO
10646, what were the smallest and largest size code units ever used for
representing character data?
Any characters bigger than 9 bits smaller than 6?
Of course, Baudot was 5-bit code used widely in Teletype networks,
DM Now, we added UTF-8 support to the ANSI task following the
DM ISO-IR 196 specification.
I assume we're talking about some kind X-based terminal emulator?
DM Does anyone know of any examples of host computers or operating
DM systems that actually use UTF-8 on an ISO 6429 implementation?
There is a character set missing from Unicode. Unicode needs a special hex
display font.
Unicode and fonts are two different things. However, I agree it would be nice
to have a repertoire of characters whose glyphs are hex values, and proposed
this a couple years ago:
On Thu, 22 Mar 2001 15:00:55 -0500, Jeff Guevin [EMAIL PROTECTED] wrote:
On Thu, 22 Mar 2001, [EMAIL PROTECTED] wrote:
Better if you also keep the distinction between "octet" (a series of
8 bits) and "byte" (a series of n bits, where n is often but NOT
always 8).
When is a byte not
On Thu, 1 Mar 2001 11:00:45 -0800 (GMT-0800), Frank da Cruz
[EMAIL PROTECTED] wrote:
This information may be a bit outdated, since it is more than a decade
since I worked daily with VMS.
VMS is an example of a platform that really, really takes advantage of
ISO standards
On Wed, 28 Feb 2001, Frank da Cruz wrote:
[...]
Cyrillic letters (e.g. capital A through PE). Most UNIX terminal drivers
treat incoming C1 controls like their C0 counterparts, so 0x83 == 0x03 ==
Ctrl-C, which interrupts whatever process you are talking to. Similarly
0x84 == Ctrl-D
I don't understand this part of your rhetoric here. In UTF-8, *ASCII* is
sacrosanct, not just "/".
Right, sorry. I withdraw my point about VMS and other pathnames.
And as for your overall point, I don't know of any claim that UTF-8
was designed for "transparent usability with hosts that
The idea behind UTF-8 is to be able to use it in non-Unicode-aware UNIX
versions: It lets you have Unicode filenames, Unicode directory names,
Unicode file contents, Unicode email, etc. But what it does not do is let
you *type* Unicode into regular UNIX applications or shells, if the UTF-8
Maybe one should make a transmission safe UTF that left C1 alone?
Remember this? --
From: Markus Scherer [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800)
Subject: What if UTF-8 had been defined after UTF-16?
What if UTF-8 had been
Oops, sorry, don't bother to tell me, it starts with an X, not a K.
- Frank
Which systems interpret 0x7F as "interrupt process"? I know that this would
be 0x03 in DOS (^C), and 0x03, 0x04 or 0x1A in Unix (^C, ^D, and ^Z,
respectively), but I know nothing about other systems, e.g. Macintosh.
Very long ago, in the Seventh Edition of Unix, the default interrupt
Could you kindly explain what does "Unicode-aware FTP client" mean ?
As I understand, the original FTP specification does not transfer any
charset information. How does your FTP client AWARE of Unicode ? Do you
mean you implement RFC 2640 ?
No, I mean the client controls everything,
Hum... interesting. What will you suggest us (Mozilla) to do to enhance
our FTP browser to support similar thing ?
Spend 20 years doing the research and writing the code, like I did? :-)
Seriously, let's continue this offline.
- Frank
I posted a message here about a month ago about C-Kermit 7.1, which
now includes a Unicode-aware FTP client. The second Alpha test has
just been announced:
http://www.columbia.edu/kermit/ck71.html
The first Alpha test converted character sets of text files, but did
not do anything about
Hi folk. The Kermit Project at Columbia University (a Unicode
Consortium member) is happy to announce a Unicode-aware FTP client
for UNIX (potentially all varieties: Linux, AIX, Solaris, etc etc),
available now for testing:
http://www.columbia.edu/kermit/ftpclient.html
In fact, it's a new
In looking at the Unicode Consortium site, I see a variety of txt and html
files that give a description of characters in different unicode blocks, but
I have not yet found a text, doc, or html file that simply contains the
actual unicode characters, either in the standard's entirety, or by
At 03:09 AM 10/12/2000 -0800, Michael Everson wrote:
Well, John, it might be helpful if I could see the other characters in the
font, as this might put the character in context. Having said that, I don't
recognize this particular one, but it reminds me of a symbol which can be
used to
"Rogers, Paul" wrote:
We're whipping up a little function named isLatin1() that returns true if
the (UCS-2) string in question is "all Latin1".
[snip]
In other words, should we exclude the C0, C1, and Latin Extended code
values?
Including or excluding C0 and C1 is a matter of
Michael Kaplan RANTed:
The assumption here is that the function will be run on Unicode text.
Therefore, the various industrial and other code pages are irrelevant.
Microsoft does not convert the characters it has in the control code range
to those same code points in Unicode, does it? Indeed,
Does anybody know of a publicly accessible FTP server that supports
RFCs 2389 (negotiation of new features) and 2640 (internationalization)?
Preferably one that allows anonymous uploads (for testing purposes)?
In case you're not aware of these RFCs, they provide for UTF-8 based FTP.
Thanks!
-
Erik van der Poel wrote:
Frank da Cruz wrote:
The irony is, when using ISO 2022 character-set designation and invocation,
you have to handle the escape sequences first to know if you're in UTF-8.
Therefore, this pushes the burden onto the end-user to preconfigure their
emulator for UTF-8
Frank da Cruz [EMAIL PROTECTED] wrote:
. If you send a code in the 0x80-8x9f range to such a terminal or
emulator, it properly treats it as a control code. If it was
intended as a graphic character ("smart quote" or somesuch) the
result is a fractured screen, some
This is, I think, a good idea. If we informally agreed to a syntax, like
"use square brackets for the topic", then people could filter for things
like "[CJK]".
This might sound silly, but some people still use ISO 646-based displays,
in which square brackets show up umlauts, etc.
On Wed, 12 Jul 2000 10:43:59 -0800, Robert A. Rosenberg wrote:
At 08:56 PM 07/11/2000 -0800, Geoffrey Waigh wrote:
On Tue, 11 Jul 2000, Robert A. Rosenberg wrote:
At 15:30 -0800 on 07/11/00, Asmus Freytag wrote:
There has been an attempt to create a series of 'touched up' 8859
On Wed, 12 Jul 2000, Frank da Cruz wrote:
Perhaps you're suggesting the Unix 'mail' should become a translation
agent between the character set of the mail and that of the user's
terminal? I hope not, since given that practically any character set
anybody can dream up is "
76 matches
Mail list logo