Re: Aleph-umlaut

2018-11-11 Thread Otto Stolz via Unicode

Am 2018-11-09 um 13:42 schrieb Mark E. Shoulson via Unicode:


Noticed something really fascinating in an old pamphlet I was reading


really interesting, thanks!



(Link is 
http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874 
look at the last few pages.)



To me, this link delivers an empty document. Please check the spelling 
of the URL.



Best wishes,

   Otto






Re: Hyphenation Markup

2018-06-02 Thread Otto Stolz via Unicode

Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode:

In Latin text, one can indicate permissible line break opportunities
between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
low-end schemes, if any, exist for such mark-up within grapheme
clusters?


What about U+200B ZWSP?


 this character is intended for invisible word
separation and for line break control; it has no
width, but its presence between two characters
does not prevent increased letter spacing in
justification


Best wishes,
  Otto Stolz


Re: Uppercase ß

2018-05-29 Thread Otto Stolz via Unicode

Hello,
am 2018-05-29 um 10:15 Uhr hat Hans Åberg geschrieben:

Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1].


There used to bee two differnt orthographic dictionaries,
both called “Duden”:
► The Duden from Leipzig (DDR) had a captal “ß”, on the cover page
  of its 1957 edition.
► The Duden from Mannheim (FRG) never has featured a captal “ß”, IIRC.


So it would be interesting with a reference to an official version.


Neither Duden has been anything like an “official version” – never ever.
Until 1996, the only official German orthography was somewhat loosely
defined by a common decision of the Ministers of Education of the FRG,
with an additional remark saying: “In case of doubt, the spelling of the
latest edition of the Duden (i. e. the Mannheim version) will take
effect.”

Nowadays, the official version of the orthographic rules
can be found in:
<http://www.rechtschreibrat.com/regeln-und-woerterverzeichnis/>;
the Uppercase-ß rule, particularily, is discussed in
<http://www.rechtschreibrat.com/DOX/rfdr_Regeln_2016_redigiert_2018.pdf>,
under §25(E3); the latest version of the rule reads thusly:

E3: Bei Schreibung mit Großbuchstaben schreibt man SS.
Daneben ist auch die Verwendung des Großbuchstabens ẞ 
möglich. 

which means:
  When writing in all-caps, you write SS.
  Alternatively, the capital ẞ may be used.

So, the normal upper-case equivalent of German sharp-S
still is the double-S. The recently introduced capital sharp-S
is an optional alternative, but not the normal way of
uppercasing the sharp-S.

Best wishes,
   Otto Stolz




Re: L2/18-181

2018-05-17 Thread Otto Stolz via Unicode

Am 2018-05-16 um 22:46 Uhr hat Doug Ewell geschrieben:

2. Collation is different between the Assamese and Bengali languages,
and code point order should reflect collation order.

…

4. The use of a single encoded script to write two languages forces
users to use language identifiers to identify the language.


I wonder how English and French ever could
be made to use a single script, let alone
German (“ß”), Icelandic (“þ”), Swedish (“å”),
Latvian (“ē”), Chech (“č”) or – you name it.

Best wishes,
  Otto Stolz


Re: A sketch with the best-known Swiss tongue twister

2018-03-09 Thread Otto Stolz via Unicode

2018-03-09 12:09 GMT+01:00 Mark Davis ☕️ via Unicode
:

De Papscht hät z’Schpiäz s’Schpäkchbschtekch z’schpaat bschtellt.
literally: The Pope has [in Spiez] [the bacon cutlery] [too late]
ordered.


Am 2018-03-09 um 12:52 schrieb Philippe Verdy via Unicode:

Is that just for Switzerland in one of the local dialectal variants ?


Basically the same in Central Swabian (I am from Stuttgart):
  I måen, mir häbet s Spätzles-Bsteck z spät bstellt.
  literally: I guess, we have ordered the noodle cutlery too late.

And when my niece married a guy with the Polish surname Brzeczek
and had asked for cutlery for their wedding present, guess what we
have told them. ☺

Otto

Solution:
  Zerst hemmer denkt, mir häbet für die Brzeczeks s Bsteck
  z spät bstellt, aber nå håts doch no glangt.


Re: Unicode education in UK Schools

2017-07-08 Thread Otto Stolz via Unicode

Hello,

am 2017-07-07 um 20:45 Uhr hat Asmus Freytag geschrieben:
I also checked whether there are accessible homework assignments that 
mention Unicode ("Hausaufgabe Unicode"). I didn't go very deep, but it 
seems that it's not untypical to relegate Unicode to a sidebar, 
explaining the "\u" notation and mentioning that you get ASCII if you 
set the upper byte to 0 (in a UTF-16 string, as supported by Java etc.).


Try also “Übung Unicode”.

Best wishes,
   Otto





Tilde (was: Unicode education in UK Schools)

2017-07-08 Thread Otto Stolz via Unicode

Hello,

am 2017-07-07 um 17:14 Uhr hat William_J_G Overington geschrieben:

I found that the character a tilde as I now know it to be called is only used 
in Portuguese.


Just for the record:

“Ô is used in Portuguese, Kashubian;
“Ñ” is used in Galician, Spanish, Mirandese, Catalan (only for Spanish 
loan words), even English (for Spanish loan words), Breton (in Peurunvan 
spelling), Basque;

“Õ” is used in Estonian, Livonian (extinct since 2013);
“Ȭ” is used in Livonian;
“Ũ” is used in Mirandese.

I have only considered European official, and regional, languages.

Cheers,
  Otto



Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-04 Thread Otto Stolz via Unicode

Hello,

on 03.07.2017 19:01, Otto Stolz via Unicode wrote:

Since German ist the only language using “ß” (if I am not mistaken), […]


Am 2017-07-03 um 20:15 Uhr hat Gerrit Ansmann geschrieben:
Some old Sorbian (blackletter) orthographies also employed the ß. It was 
also used at the beginning of words where it was capitalised to Sſ at 
the beginning of sentences or similar.


I was referring to contemporary writing systems. Indeed, several
east European languages (including, e. g. Latvian) were written
in blackletter, with German sound-letter correspondence, before
they developped their own writing systems.

Thanks for pointing to this particular uppercasing rule.

I have not thought of Yiddish, though. This used to be written
with Hebrew letters (plus some particular ligatures). Usually,
it is transliterated into the Latin script according to the
YIVO rules of 1936. In Germany, there is an alternative tran-
scription in use, defined by Ronald Lötzsch in 1990. The latter
has the “ß” also in the beginning of words. However, there is
no upper-case equivalent, as Yiddish has no case distinction,
hence all Yiddish letters are transcribed to lower-case Latin,
even in the beginning of a sentence.

I am not aware of all-caps being 
used (which was very rare in blackletter in general).


The only word to be printed in blackletter all-caps was
– as far as I know – “der HERR”, or “der HErr”, meaning
“the Lord” (in texts from the bible). In general, blackletter
capitals are not designed for all-caps, so that would look
disgustingly. Thence the form “HErr“ which is a bit more
readable.

Best wishes,
  Otto



Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-03 Thread Otto Stolz via Unicode

Hello,

am 2017-06-30 um 17:34 Uhr hat Michael Everson geschrieben:

It would be sensible to case-map ß to ẞ however.


Since German ist the only language using “ß” (if I am
not mistaken), Unicode should comply with the official
German orthographic rules with respect to this letter.

As I have reported to this list, § 25 E3 of the official
German spelling rules clearly give preference to “SS”
as the uppercase equivalent for “ß”. And before the latest
(2017) update of those rules, “ẞ” was not allowed, at all.

Best wishes,
   Otto


Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-03 Thread Otto Stolz via Unicode

Hello,

am 2017-07-03 um 18:16 Uhr habe ich geschrieben:

This rule did hold for all consonants, there’s nothing
particular about double-s.


On 2017-07-03 at 18:05 Jörg Knappen had written:

the hyphenation oddity … never affected the letter s.


Jörg is right. I forgot the additional rule that you
had to spell “ß” instead of “ss” at the end of every
constituent of a compound word, so the rule I reported
would never be applied to “ss”. Also the “ss” → “ß”
rule has been dropped by the spelling reform of 1996.

Btw., the dropping of said ß rule has led to much
controversy during the ’90s. Most people were not
aware that that very rule had been introduced by the
pen-ultimate spelling reform, in 1901.

Best wishes,
  Otto




LATIN CAPITAL LETTER SHARP S officially recognized

2017-06-30 Thread Otto Stolz via Unicode

Hello,

der Rat für deutsche Rechtschreibung which is responsible
for the further development of the official German ortho-
graphy has finally recognized LATIN CAPITAL LETTER SHARP S
as a possible upper-case equvalent for the LATIN SMALL
LETTER SHARP S.

The report announcing the change is dated 2016-12-08, but
the official rules have been updated only yesterday, so
the change is currently in the news (not very prominently,
though).

The pertinent section from the official 2107 rules reads thusly:

§ 25 E3
Bei Schreibung mit Großbuchstaben schreibt man SS.
Daneben ist auch die Verwendung des Großbuchstabens 
ẞ möglich.  Beispiel: Straße – STRASSE –STRAẞE.


Which translates to:

When writing all caps, you spell SS.
Alternatively, it is possible to use the upper-case ẞ.
Example: Straße – STRASSE –STRAẞE.


So, SS remains the primary upper-case equivalent of ß.
Yesterday’s note to the press says that the capital ẞ
is meant mainly for passports and similar official
documents, wich have to reproduce personal names faith-
fully in their respective spelling variants. E. g.,
Passports used to distinguish proper names such as
GROẞMANN and GROSSMAN; up to now, they usually have
spelled GROßMANN, with a small ß between the capitals,
which renders ugly, in most fonts.

Best wishes,
   Otto










Re: Encoding of old compatibility characters

2017-04-05 Thread Otto Stolz

Helo,

Am 31.03.2017 um 09:57 schrieb Eli Zaretskii:

Arial Unicode MS supports that character [U+23E8], FWIW.


From: Otto Stolz<otto.st...@uni-konstanz.de>
Date: Tue, 4 Apr 2017 15:21:02 +0200

Not on my good ole Wndows XP SP3 system.


On 4/4/2017 7:58 AM, Eli Zaretskii wrote:

This here is also XP SP3.  Maybe some package I have installed updated
the font?


Am 04.04.2017 um 18:51 schrieb Asmus Freytag:

AFAIK, this font is / was installed by MS Office.


I have  got MS Word 2002 and MS Excel 2000.
Maybe, later versions bring an amended version of Arial Unicode MS.

Cheers,
   otto





Re: Encoding of old compatibility characters

2017-04-04 Thread Otto Stolz

Am 31.03.2017 um 09:57 schrieb Eli Zaretskii:

Arial Unicode MS supports that character [U+23E8], FWIW.


Not on my good ole Wndows XP SP3 system.

Best wishes,
   Otto


Re: Standaridized variation sequences for the Deseret alphabet?

2017-03-23 Thread Otto Stolz

Hello Michael, others,

On 2017/03/23 09:03, Michael Everson wrote:

Its the same diphthong (a sound) written with different
letters.


Am 23.03.2017 um 06:54 schrieb Martin J. Dürst:

I think this may well be the *historically* correct analysis. And that
may have some influence on how to encode this, but it shouldn't be
dominant.

What's most important is (past and) *current use*.


Same issue as with German sharp S: The blackletter »ß« derives from an
ſ-z ligature (thence its German name »Eszet«), whilst the Roman type
»ß« derives from an ſ-s ligature. Still, we encode both variants as
identical letters. I’ve got a print from 1739 with legends in both
German (blackletter) and French (Roman italics), comprising both types
of ligatures in one single document.

Best wishes,
  Otto



Re: "Oh that's what you meant!: reducing emoji misunderstanding"

2016-11-18 Thread Otto Stolz

Am 18.11.2016 um 00:31 schrieb Doug Ewell:

Or, people could just say what they mean, using language.


This is not so easy, as already Lewis Carroll had seen,
cf. this snippet from “Alice in Wonderland”:

“Then you should say what you mean,” the March Hare went on.
“I do,” Alice hastily replied; “at least—at least I mean what I say—
that’s the same thing, you know.”
“Not the same thing a bit!” said the Hatter.



Best wishes,
   Otto




Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-17 Thread Otto Stolz

Hello,

am 2016-09-17 um 11:19 Uhr hat Mats Blakstad geschrieben:

Is there any dataset that contains all languages in the world sorted by
country/territory?


Have you tried <http://www.ethnologue.com/>, already?

Also, <http://www.sil.org/>
and <http://www-01.sil.org/iso639-3/codes.asp>
may provide partial answers.

Best wishes,
  Otto Stolz



Re: Announcing The Unicode® Standard, Version 9.0

2016-06-23 Thread Otto Stolz

Ciao,

il 2016-06-22 alle 00:02 announceme...@unicode.org ha scritto:

Version 9.0 of the Unicode Standard is now available.

…

MOTOR SCOOTER


Almost exactly 70 years after its invention, “la vespa” has
found her way into Unicode. I have related that important news,
immediately, to the members of my Italian language class ;-)

Auguri,
  Otto




Re: Polyglot keyboards

2016-05-10 Thread Otto Stolz

Hello,

I had written:

<https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.


On 2016-05-10 16:55 GMT+02:00 Doug Ewell has written:

QWERTY users are about as willing to switch to QWERTZ


I have never meant that QWERTY – or AZERTY – users should
switch to QWERTZ. I just wanted point to one instance of
an officially standardized polyglot keyboard layout.

E. g, there is already the Canadian multilingual keyboard, cf.
<https://en.wikipedia.org/wiki/File:KB_Canadian_Multilingual_Standard_comment-en.svg>,
based on the traditional QWERTY layout.  I do hope that other
standard bodies will follow suit and define their own QWERTY,
or AZERTY, or whatever versions of polyglot keyboard layouts,
in accordance with ISO/IEC 9995.

Am 2016-05-10 um 17:30 Uhr schrieb Philippe Verdy:

All that can be made reasonable is to extend existing layouts with minimal
changes:

…

This leaves little freedom for changes except for keys currently assigned
to less essential characters such as the degree sign, the micro sign, the
pound sign (in countries not usingf this symbol daily), the "universal"
currency sign, the paragraph mark... Those can be used to fit better
candidates for extensions.


Another option (which I exploited for my personal keyboard layouts) is
the re-definition of a special-character key to work as a dead key.
E. g., on my personal keyboard, the »"« key gives access to all sorts
of quote characters (for French, German, English, …, even ASCII),
depending on the following key; the »~« key works as tilde accent
on the letter typed subsequently; and so on. This scheme allows the
conventional QWERTZ hardware to be used for multilingual typing –
with minimal re-learning and training. And still the »µ« key  produces
the »µ« character :-)

Best wishes,
  Otto Stolz


Polyglot keyboards (was: Non-standard 8-bit fonts still in use)

2016-05-10 Thread Otto Stolz

Hello,

am 2016-05-08 um 20:11 Uhr schrieb Don Osborn:

Another thing about user needs is that the polyglot/pluriliterate user
may prefer something that reflects that, as opposed to having multiple
keyboards for languages whose character repertoires are much the same.
 From a national or regional (sub-continental) point of view I would
think a one-size fits all/many standard or set of keyboard standards
would be ideal. But no one seems to be going there yet, after all these
years.


Yes, there is somebody going there. E. g., the German standard
DIN 2137:2012-06 defines a “T2” layout which is meant
for all official, Latin-based orthographies worldwide, and
additionally for the Latin-based minority languages of Germany
and Austria. The layout is based on the traditional QWERTZU layout
for German and Austrian keyboards (which is now dubbed “T1”).
Cf. <https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.

There is also a “T3” layout defined which comprises all characters
mentioned in ISO/IEC 9995-3:2010.

You can even buy a hardware T2 keyboard; however I have not tried it,
because I have defined my own keyboard layout suite (pan-European Latin,
pan-European Cyrillic, monotonic Greek, and Yiddish) for personal use,
long ago.

Best wishes,
  Otto Stolz


Re: Swapcase for Titlecase characters

2016-03-20 Thread Otto Stolz

Hello,

Am 19.03.2016 um 17:40 schrieb Doug Ewell:

As one anecdote (which is even less like "data" than two anecdotes), I
could not find any of the characters IJ ij DŽ Dž dž LJ Lj lj NJ Nj nj or their hex
equivalents in any of the CLDR keyboard definitions. I'd imagine that
users just type the two characters separately, and that consequently
most data in the real world is like that.


For »IJ«,
cf. .

Regards,
  Otto




transliteration of mjagkij znak (Cyrillic soft sign)

2016-02-08 Thread Otto Stolz

Hello,

I am wondering how U+02B9 MOFIFIER LETTER PRIME made
its way into the Unicode repertoire, and how it
acquired its comment “transliteration of mjagkij znak
(Cyrillic soft sign: palatalization)“.

ISO/R 9:1954 through ISO/R 9:1986 map the mjagkij znak
“ь” to the apostrophe, and so does DIN 1460:1982. The latter
clearly depicts the apostrophe that later became U+02BC,
while I am not sure whether also ISO/R 9 does so or rather
depicts a glyph like U+0027. (All of these standards
predate Unicode, so they just depict glyphs.)

ISO/R 9:1995 maps the mjagkij znak “ь” to the prime,
particularly to the modifier letter U+02B9, in accordance
with the comment in the Unicode charts.

Unicode archeologists, can you shed some light on the
history of both U+02B9 and the mjagkij znak?

And linguists, can you tell me how the mjagkij znak is
transliterated normally, as an apostrophe or as a prime?

Thanks for any comments,
  Otto



Re: Question about Perl5 extended UTF-8 design

2015-11-06 Thread Otto Stolz

Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich:

First of all, “reserved” means that they have no meaning.  Right?


Almost.

“Reserved” means that they have currently no meaning
but may be assigned a meaning, later; hence you ought
not use them lest your programs, or data, be invalidated
by later amendmends of the pertinent specification.

In contrast, “invalid”, or “ill-formed” (Unicode term),
means that the particular bit pattern may never be used
in a sequence that purports to represent Unicode characters.
In practice, that means that no programm is allowed to
send those ill-formed patterns in Unicode-based data exchange,
and every program should refuse to accept those ill-formed
patterns, in Unicode-based data exchange.

What a program does internally is at the discretion (or should
I say: “whim”?) of its author, of course – as long as the
overall effect of the program complies with the standard.

Best wishes,
  Otto Stolz







Re: [somewhat off topic] straw poll

2015-09-12 Thread Otto Stolz

Am 10. September 2015 um 20:04 h schrieb Peter Constable:

[…] creating a Web page containing (say) some Latin characters
- not obscure,  […]  to use (say) Notepad and entering HTML
numeric character references; and that my findings were that
it worked.



Q1: Would you find that to be an interesting post […]


A1: No, because the scenario given is about a standard technique
that every list participant is supposed to be aware of.

I’d simply ignore a message of this type. If, however, a message
were asking a question on this technique, I’d probably sent the
author a short reply pointing to the pertinent FAQ entry, or
HTML tutorial.


Q2: If I were to send messages along that line on a regular basis,
would that add value to your participation in the list, or reduce it?


A2: Neither.

If a particular author became notorious of this sort of contributions,
I’d start to ignore his messages, altogether. If his messages would
develop into a nuisance, I’d add him to the filter rules of my e-mail
client.


Q3: If 50 people (still a small portion of the list membership)
were to send messages along that line on a regular basis, would
that add value to your participation in the list, or reduce it?


A3: All of them would not start doing so at the same time, wouldn’t
they? Hence, A2 would apply, on a per-case basis, without much ado.

Best wishes,
  Otto






Re: a suggestion new emoji .

2015-08-19 Thread Otto Stolz

Hello Emma Haneys,

Am 19.08.2015 um 01:20 schrieb Emma Haneys:

i just wondering if i can suggest a new emoji .
hoppefully you can respone to me .


So far, you have only received derisive responses from
the Unicode discussion list. This is because you have
not understood how suggestions for Unicode characters work.
Please read ‹http://www.unicode.org/faq/›, in particular
‹http://www.unicode.org/faq/char_proposal.html›.


i suggest one and only for fruit
category . it is a durian .


You cannot suggest a new character just because it would
be “nice to have”. Rather, you have to supply evidence that
an additional character really needs to be encoded, e. g.
because it is already widely used in print and cannot be
represented in Unicode.

Best wishes,
  Otto Stolz



Re: Unicode organization is still anti-Serbian and anti-Macedonian

2014-02-17 Thread Otto Stolz

Hello,

Крушевљанин Иван had written:

People, do you realize that proper glyphs are needed everywhere and
every time, CONSTANTLY, even when American ordinary user chats with
German ordinary user about Serbian language


Am 2014-02-17 um 00:50 Uhr MEZ schrieb Richard Wordingham:

One issue here that I don't know the solution for is how the right
glyphs should be chosen for displaying plain text communication.  I
don't know any general mechanism for, say, specifying that by
default Cyrillic text should use Serbian glyphs, CJK characters
should use Japanese glyphs and that Cuneiform should use Neo-Assyrian
glyphs.


This boils down to the fact that, in plain-text communication, the
receiver can – and should – chose the appropriate font. This holds,
in particular, for classical e-mail. Thence my recent claim that the
problem posed by Иван is a mere font issue.

In HTML, this is a bit different: The author has control over the
fonts (thence over the glyphic style) used for the display, but the
reader can normally override the author’s choice. Hence, WWW authors
should specify suitable fonts for their respective articvles (or even
parts thereof).

On paper, or in PDF and other facsimile formaats, the author is
entirely responsible for the glyphic style and appearnce, and he
should always chose suitable fonts. This is the realm of the
solution involving that ‘Gentium Plus srp’ font I had mentioned,
recently.

May i humbly remind Иван (and all other readers of this thread) that
the problem manifests itself (mainly or only) with italic style
letters; hence there remains virually no problem with normal
(non-italic) style.

Best wishes,
  Otto Stolz


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode organization is still anti-Serbian and anti-Macedonian

2014-02-15 Thread Otto Stolz

Hello,

Am 14.02.2014 11:37, schrieb Крушевљанин:

There is still problem with letters бгдпт in italic, and б in regular mode.


As has been said, already, in this thread, this is a mere font issue:
you have to use a particular font in order to display those italic
letters, in the Serbian and Macedonian style.

One example:
The ‘Gentium Plus’ font from SIL International, available from
http://scripts.sil.org/cms/scripts/page.php?item_id=Gentium
can be configured to display the Serbian/Macedonian style italics
rather than the glyphs used elsewhere. If this configuration is
too cumbersome for you, feel free to ask me privately, for a copy
of the font, configured for Serbian/Macedonian. I can send you
that copy, without any obligation to maintain it or to adapt forth-
coming versions.

Best wishes,
  Otto Stolz



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Dotted Circle plus Combining Mark as Text

2013-10-22 Thread Otto Stolz

Hello,

am 2013-10-20 um 21:32 Uhr schrieb Asmus Freytag:

A typical convention  is the show special characters in many editors.


Or in special fonts, such as CombiChar
http://eta.ktl.mii.lt/~mask/varia/KoDi-07/Inf_mokslai/Tumasonis/CombiChar.ttf,
probably by Vladas Tumasonis.


Incidentally, the dotted circle shown in the Unicode Code charts is
*not* 25CC, and if I were to implement a show dotted circle feature in
a program I would not use 25CC for this - that character has a standard
glyph of rather unsuitable metrics for the purpose, never mind that many
people have co-opted it.


I have written a tutorial (in German) that discusses character entry,
and typographical conventions, for virttually all European languages;
therein, I have to display many diacritikal marks over dotted circles.
I reckon, this problem is not specific to my work.

I have tried two approaches, viz. applying the respective diacritica
to U+25CC, and formatting diacritical marks with the CombiChar font,
respectively. I am not happy with either, but currently apply the
latter, throughout.

As Asmus already has said, U+25CC does not look quite right.
On the other hand, the CombChar font – in order to fulfil its purpose –
necessarily violates D56 of TUS.

So what should a poor author do according to TUS, when he wishes to
discuss diacritical marks and their associated keystrokes?
D52 says about the combining characters:

These characters are not normally used in isolation unless
they are being  described.

But how can they be described within the framework of Unicode?

Bemused,
  Otto Stolz





Re: Why blackletter letters?

2013-09-11 Thread Otto Stolz

Hello,

am 2013-09-10 um 22:43 Uhr hat Gerrit Ansmann geschrieben:

In contrast to Greek and Coptic (as far as I
understand them), changing a modern text to fraktur is only a change of
the font


This is not so.

Fraktur text is subject to orthographic rules different
from those applying to text in modern Latin.

E. g., in German fraktur text, there are specific rules
for differentiating Long S »ſ« from Round S »s«, while
in modern Latin text only the Round S has been used for
decades (the latest Long S in modern Latin German printed
text I have seen is from the 1950s, when it was already
rather unusual; the official German spelling rules from
1996 do not mention the Long S any more). Hence, a modern
Germn text, when simply transliterated into fraktur, will
not be orthographically correct.

The various abbreviations used in older fraktur text,
but not in modern Latin script, have already been mentioned
by other contributions to this thread.

Best wishes,
  Otto Stolz




MSKLC restrictions (was: Ways to show Unicode contents on Windows?)

2013-07-31 Thread Otto Stolz

Hello,

am 2013-07-30 um 02:11 Uhr hat Ilya Zakharevich geschrieben:

   [I’m switching to “pedantic” mode since there are so many posts on
this list which FALSELY accuse Windows' keyboard system of
shortcomings.  There ARE many shortcomings, but they are usually
buried under an avalanche of misinformation.]

…

  Windows itself supports 2¹⁶ - 1 dead keys.  Due to bugs in MSKLC
  (in kbdutool), one is restricted to having 2¹² - 1 dead keys in a
  .klc.

…

   b) A significant limitation of Windows keyboard system is that one
  cannot enter an OUT-of-BMP character via a deadkey.  (And this is
  probably what you meant above.)


Another limitation: Apparently you cannot define sequences
of dead keys. If I am mistaken, I’d appreciate any hint
on how to define dead key sequences, in the MSKLC framework.

E. g., for a Greek-polytonic keyboard the natural approach
would be to have just seven dead keys, three for the three
different accents, two for the two different spiritus, one
for the subscripted Iota, and one for the diaresis. Then,
a character with multiple diacritica, such as 1F96 “ᾖ”
(GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND
YPOGEGRAMMENI) would be entered with the corresponding
multiple dead keys (three, in this example).

Compare this to the current Greek-Polytonic Windows keyboard
layout, which is virtually uncomprehensible, because you have
to memorize one particular dead key for every single combination
of diacritica (more than 30 different dead keys).

I guess, also for a Viatnamese keyboard layout, sequences
of dead keys would come handy.

Best wishes,
 Otto Stolz






Capitalization in German (was: s-j combination in Unicode?)

2013-02-19 Thread Otto Stolz

Hello,

am 17.02.2013 06:55, schrieb Stephan Stiller:

As far as real ambiguities are
introduced, the loss of capitalization on the first letter introduces
far more, impressionistically speaking, and they might be legally
subtle;


Here is a minimal pair to illustrate that point:
Er hat in Moskau liebe Genossen.
Er hat in Moskau Liebe genossen.
which translates to:
At Moskow, he’s got dear comrades.
At Moskow, he has enjoyed love.
Btw., in spoken language, the individual words are
pronounced identically, whilst the prosody of the
whole sentence is different (and, clearly, significant).

Best wishes,
  Otto Stolz






German »ß« (was: s-j combination in Unicode?)

2013-02-16 Thread Otto Stolz

Hello,

Am 16.02.2013 11:48, schrieb Stephan Stiller:

Or a non-name example: Buße (repentance)
vs Busse (buses). But then, non-name examples are far less likely to
remain ambiguous in context.


Years ago, I have seen with my own eyes, in a Swiss magazine
(where they consistently replace “ß” with “ss”), the following
amusing example:
  … Brigitte Bardot mit ihren beachtlichen Körpermassen …
which translates to: “BB, and her considerable bodyly masses”,
whilst the author probably wanted to say: “BB, and her
remarkable physical measurements (=body shape)”.

During the discussion on the German spelling reform, in the 1990s,
the same minimal pair has been used in the following context:
  Es ist ein Unterschied, ob ich Bier in Maßen trinke oder in Massen.
meaning: “It makes a difference, whether I drink beer in moderation,
or in masses”.

Minimal pairs for “ß” vs. “ss”, not involving proper names,
are extremly rare; in fact, I only know the two mentioned
in this very note. Between ordinary words and proper names
(or place names), you can, of course, find more minimal pairs,
e. g., “Füßen” (a declension form of “Fuß” = foot) and “Füssen”
(a town in Bavaria).

Cheers,
  Otto







Re: Long-term archiving of electronic text documents

2013-01-28 Thread Otto Stolz

Hello,

am 28.01.2013 schrieb William_J_G Overington:

The idea is that there would be an additional UTF format, perhaps UTF-64,
so that each character would be expressed in UTF-64 notation using 64 bits,
thus providing error checking and correction facilities at a character level.


We have already the UTF-32, where every 21-bit character takes 32 bits.
So there is plenty of unused space that could be used for error checking
on the character levl, if so ever would be desired.

Of course, a format that carries additional information in the
otherwise ‘unused’ bits does not comply with the UTF-32 specs;
so, if that idea would ever materialise, it would have to sail
under new colours.

Best wishes,
  Otto Stolz



Re: Is that character *+A7AC LATIN CAPITAL LETTER SCRIPT G ?

2013-01-10 Thread Otto Stolz

Hello,

le 09/01/2013 18:07, Frédéric Grosshans a écrit :

Yes, but I actually don't know. I'd really like to have some idea on those old
printing techniques, but I fear we're drifting to off topic subjects...


Am 2013-01-09 um 18:16 schrieb Frédéric Grosshans:

Actually, the preceding tool combined with
http://en.wikipedia.org/wiki/Mimeograph would be my best (uninformed) guess.


I’d rather guess, he used this technique:
  http://en.wikipedia.org/wiki/Dry_transfer.
I have used it myself, in the 70s, to insert all those
Greek symbols into the formulae in my Dipl.-Phys. thesis.
It renders much clearer glyphs than the mimeograph
technique.

Best wishes,
  Otto Stolz




Re: UCA and Russian letter Ё

2012-12-23 Thread Otto Stolz

Hello,

Leo Broukhis hatte geschrieben:

In Russian, the difference between Е and Ё is primary at the beginning
of a word as they are considered distinct letters of the alphabet, yet
secondary in the middle of a word, as the dieresis over Ё is not
mandatory. As an example, ель  ёлка, but тёлка  тель, see
http://ru.wikisource.org/wiki/Орфографический_словарь_русского_языка


Am 2012-12-21 um 20:05 Uhr schrieb Leif Halvard Silli:

My Moscow Russian-Norwegian from 1987 and my Pocket Oxford Russian
Dictionary from 2003 agree that both list words on Ё and Е under the
same category – namely, under the letter Е.


So do both “Русско-Немецкий Словар” (Moskow,1955) and  “Langenscheidts
Taschenwörterbuch”, 4. Aufl. (Berlin 1963).

Hence, I deem Leo’s example a red herring.

Best wishes for a merry Xmas (or whatever) and a happy New Year,
  Otto








Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

am 2012-12-15 schrieb Philippe Verdy:

But there's still a bug (or request for enhancement) for your Pocket
converters :

- For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates)
from the sets of convertible codepoints.

- But you don't exclude this range in the case of your UTF-8 and UTF-32
magic encoders which could forget this case. Of course your encoder would
create distinct sequences for these code points, but they are not valid
UTF-8 or valid UTF-32 encodings.


Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE);
the author is nominated on every one of the three.

I would not demand more from those MPEs than converting
a valid UCS character to a valid, and equivalen, UTF
sequence – and to illustrate the underlying algorithm.
I guess, originally, they were meant as jokes – partially,
at least; I have used them as a didactic device, in my
beginner's lecture in Unicode.

Clearly, Mike Ayers made the point that the UTF-32 encoding
is nothing but a simple shortcut (in the terms of its two
predecessors). His one-row-only MPE expresses this quite
aptly, and any additional branch would spoil the impression.

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.

Cheers,
  Otto Stolz








Re: UTF-8 ill-formed question

2012-12-16 Thread Otto Stolz

Hello,

2012/12/16 Otto Stolz otto.st...@uni-konstanz.de

The reason I excluded the surrogates from my UTF-8 MPE
was really that I needed additional space for the user’s
guide on the reverse side.


Sorry, typo; I meant: “my UTF-16 MPE”. I added that
extra row (with the branch excluding the surrogates)
to gain extra space on the reverse sode.

Am 2012-12-16 schrieb Philippe Verdy:

Add this missing row, Everything in the reverse side can remain the same
(or can be using a less cryptic compact description of how it works).


I will certainly not change Marco Cimarosti’s original design
of his UTF-8 MPE.

Best wishes,
  Otto Stolz





Re: UTF-8 ill-formed question

2012-12-12 Thread Otto Stolz

Hello,

am 2012-12-11 20:16, schrieb James Lin:

If i have a code point: U+4E8C or 二
In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C.
Where is this BA comes from?


Cf. http://skew.org/cumped/.

Enclosed are the (almost original) version of “€œCima’s Magic
UTF-8 Pocket encoder”€ (2004), and its two followers for
more UTFs. Display or print with a fixed-pitch font,
such as Lucida Console or Courier New. Enjoy!

Cheers,
   Otto Stolz


Side 1 (print and cut out):

++---+---+--+
| U+ | yy zz |Cima's UTF-8 Magic | Hex= |
| U+007F | !  !  |Pocket Encoder | B-4  |
| YZ | .  .  |   |  |
++---+---+ Vers. 1.1 | 0=00 |
| U+0080 | 3x xy | 2y zz |  30 June 2004 | 1=01 |
| U+07FF | 3. .. | 2. !  |   | 2=02 |
|XYZ | .  .  | .  .  |  M.C. | 3=03 |
++---+---+---+   | 4=10 |
| U+0800 | 32 ww | 2x xy | 2y zz |   | 5=11 |
| U+ | !  !  | 2. .. | 2. !  |   | 6=12 |
|   WXYZ | E  .  | .  .  | .  .  |   | 7=13 |
++---+---+---+---+ 8=20 |
| U-0001 | 33 0v | 2v ww | 2x xy | 2y zz | 9=21 |
| U-000F | !  0. | 2. !  | 2. .. | 2. !  | A=22 |
|  VWXYZ | F  .  | .  .  | .  .  | .  .  | B=23 |
++---+---+---+---+ C=30 |
| U-0010 | 33 10 | 20 ww | 2x xy | 2y zz | D=31 |
| U-0010 | !  1. | 2. !  | 2. .. | 2. !  | E=32 |
|   WXYZ | F  4  | 8  .  | .  .  | .  .  | F=33 |
++---+---+---+---+--+

Side 2 (print, cut out, and glue on back of side 1):

+---+
| Cima's UTF-8 Magic Pocket Encoder - User's Manual |
| (vers. 1.1, 30 June 2004, by Marco Cimarosti) |
|   |
| - Left column: min and max Unicode scalar values: |
|   pick the row that applies to the code point you |
|   want to convert to UTF-8. Letters V..Z mark the |
|   hexadecimal digits that have to be processed.   |
| - Right column: hexadecimal to base-4 table.  |
| - Central columns: work area to compute each octet|
|   (1 to 4) that constitute UTF-8 octet sequences. |
| Convert each digit marked by V..Z from hex. to|
| b.-4. Write b.-4 digits on the dots placed under  |
| letters v..z (two b.-4 digits per hex. digit).|
| Convert 2-digit base-4 number to hex. digits and  |
| write them on the dots on the line. That is your  |
| UTF-8 sequence in hex.  ! Exclamation marks show  |
| passages that may be skipped, either because the  |
| digit is hard-coded, or because it may be copied  |
| directly from the scalar value.   |
+---+

Enjoy!

Marco
Obverse: Print with a fixed-width font, such as Lucida Console,
and cut out.

╔╦═╦═╗
║ U+ ║ W  X  Y  Z  ║ Otto’s Magic Pocket Encoder ║
║ U+D7FF ║ !  !  !  !  ║ for UTF-16  ╔═══╣
║   WXYZ ║ _  _  _  _  ║ ║Vvv │Vvv ║
╟╫─╢ Version 1.1 ║Uuu │Uuu ║
║ U+E000 ║ W  X  Y  Z  ║ ©2004-07-05 ║ ttT│ ttT║
║ U+ ║ !  !  !  !  ║ ║___ │___ ║
║   WXYZ ║ _  _  _  _  ║ ║ ┼ ║
╟╫─╚═╣0=00 │ 138=20 ║
║ U-0001 ║ 31 2t tu uv │ 31 3v Y  Z  ║ 001=01 │ 209=21 ║
║ U-000F ║ !  2_ __ __ │ !  3_ !  !  ║ 012=02 │ 21A=22 ║
║  TUVYZ ║ D  _  _  _  │ D  _  _  _  ║ 023=03 │ 22B=23 ║
╟╫─┼─╢ 034=10 │ 23C=30 ║
║ U-0010 ║ 31 23 3u uv │ 31 3v Y  Z  ║ 105=11 │ 30D=31 ║
║ U-0010 ║ !  !  3_ __ │ !  3_ !  !  ║ 116=12 │ 31E=32 ║
║   UVYZ ║ D  B  _  _  │ D  _  _  _  ║ 127=13 │ 32F=33 ║
╚╩═╧═╩═══╝


:1:2:3:4:5:6..


Reverse: Cut out and paste on back of obverse.

╔╗
║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║
║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║
╠╣
║• Left column: min and max Unicode scalar values: pick the  ║
║  row that applies to the code point to be converted.   ║
║  T…Z mark the hexadecadic digits that have to be processed.║
║• Central column: work area to compute UTF-16BE code units. ║
║• Right column: hexadecadic to quaternary conversion tables:║
║   for T to tt; = for U/V to uu/vv (step 1) and for step 2.║
║1. Convert each digit marked by T…V from hex to quat. Write ║
║   quat digits on the underscores placed under letters t…v. ║
║2. Convert 2-digit quat numbers to hex digits or copy digits║
║   W…Z, as indicated, and write them on the underscores on  ║
║   the next line. That’s your UTF-16BE sequence in hex

Re: Searching data: map countries to scripts

2012-08-22 Thread Otto Stolz

Hello Manuel,

am 2012-08-20 01:05, schrieb Manuel Strehl:

I'm looking for a data source, that maps countries to scripts used in
them. The target application is a visualization in the context of my
codepoints.net site, namely http://codepoints.net/scripts.

At the moment I've extracted the prefered scripts from CLDR (e.g., Cyrl
for Russia, Latn for Germany and so on).


One more source you could peruse is the Ethnologue
http://www.ethnologue.com/web.asp.

It contains a mapping from countries to languages:
http://www.ethnologue.com/country_index.asp.
Many languages are tagged with population and usage data;
in the “more information” section of any language,
usually the script is noted.

In many cases, the Ethnologue lists, as seperate
languages, what is normally considered as dialects.
For your project, this is not a major problem,
as the dialects normally use the same script as the
respective parent languages. Furtermore, the usage/
population data given for most languages will guide
you in the assessment of the relative significance
of the variaous languages, and associated scripts.

Good luck,
  Otto



Apostrophe, and DIN keyboard (was: U+25CA LOZENGE)

2012-08-13 Thread Otto Stolz

Hello,

am 2012-08-13 18:09, schrieb Andreas Prilop:

  http://www.machsmit.de/media/mainteaser/header-ichwillserleben.png
  http://www.machsmit.de/kampagne/printmedien.php
show what the braindead German DIN keyboard layout has done to
the apostrophe (’): Killed by the acute accent (´).


DIN 2112 (from 1928) for mechanical typewriters had indeed no
apostrophe key, due to lack of keys (remember: there are 4 more
letters in the German alphabet than in the US-English one).
However, this standard has been withdrawn, in 2002.

DIN 2137 (from 1976) is for computers:
These keyboards always had both the acute, and grave, accents,
and the (ASCII) apostrophe.

Andreas’ example does not present any evidence that
an acute accent is involved. It could as well be a
real U+2019 apostrophe, rendered in a slanted, sanserif
font. As the text is presented in PNG, i. e. grafic,
format, you really cannot tell the difference.

Best wishes,
  Otto Stolz





German »Raute« (was: U+25CA LOZENGE)

2012-08-13 Thread Otto Stolz

Hello,

am 2012-08-13 20:48, schrieb Leif Halvard Silli:

The word 'Raute' reminds of the Norwegian 'rute' - and my Norwegian
book on etymology assumes that 'rute' is derived from 'Raute'. The
Norwegian 'rute' may refer to a cell in a (data) table or in a square
board for chess. Such a 'rute' is of course a square. Perhaps German
'Raute' has a similar possibility of being interpreted as square?

Btw, the Norwegian for 'diamond', in the playing card sense, is
'ruter'. The 'ruter' in the playing card sense, is easily associated
with 'rute' - in other words: square. However, we see that it is not a
square, in the normal sense. The modern German name for diamond
cards, Karo,  geht auf lateinisch quadrum „Viereck, Quadrat“ zurück.
http://de.wikipedia.org/wiki/Karo_(Farbe)



In German, »Raute« is a synonym of »Rhombus«, i. e.
an equilateral quadrilateral. Hence, every »Raute«
is a »Quadrat« (square), but not vice versa.
(A square has also four equal angels.)

Rhombuses are often depicted resting on a vertex,
whilst squares are usually depicted resting on an edge.
But the orientation of a geometrical shape really does
not change its geometric features, nor its name.

Best wishes,
  Otto Stolz



Re: Sinhala naming conventions

2012-07-17 Thread Otto Stolz

Hello Naena Guru,

am 2012-07-15 20:36, schrieb Naena Guru:

my challenge stands [...] to show how romanized Singhala violates
any standard in what specific way.


Your “Romanized Singhala” neither complys with, nor
hurts any standard. As it is your own invention, it
is simply out of scope of all existing standards.

However, tagging your “Romanized Singhala” as ISO 8859-1
encoded text, certainly violates several standards!
E. g., ISO 8859-1 defines code 41 (hex) to be a
Latin character (viz. “A”); so this code can never
represent a Singhala letter, when it is part of an
an ISO 8859-1 encoded text. Likewise, your scheme
violates all pertinent standards of text tagging,
such as RFC 2046: All of those describe how the
encoding of a particular data stream is specified,
so the receiving side will interprete it correctly.
Clearly, if you send a Latin character (and tag it
as such), you cannot expect the receiving side to
interprete it as a Singhalese character.

So, the only way for your encoding scheme to be
exploited without violating the pertinent standards
would be to register it under a new name, and then
tag your data accordingly. However, as there are
apparently no technical problems with the existing
solution (i. e. properly tagged Unicode data), your
new encoding scheme will properly not gain wide ac-
ceptance.

You can serve your language much better when you
try to improve current solutions within the realm
of existing standards.  E. g., you could point out
errors (or shortcomings) in existing fonts, editors,
keyboard drivers, and other software and suggest
(or even provide) better solutions. Or you could
publish tutorials or examples of good practice.
In any case, it would be wise to know the existing
standards and comply with them.

Best wishes,
  Otto Stolz






Re: ASSAMESE AND BENGALI CONTROVERSY IN UNICODE STANDARD ::::: SOLUTIONS

2012-07-10 Thread Otto Stolz

Hello Satyakam Phukan,

am 2012-07-10 18:33, schrieb Satyakam Phukan:

The Bengalis have dropped the letter ৱ. Their “ক্ষ“ is a different
issue on which Unicode have been  told umpteen times by many, for
Assamese it is a letter and for Bengalis it is a combination or
conjunct.


See http://www.unicode.org/faq/char_combmark.html#11,
and related FAQs.

Before asking for “improvements”, you should be familiar
with the pertinent FAQ collection, at the very least.

Sincerely,
  Otto Stolz





Charset declaration in HTML (was: Romanized Singhala - Think about it again)

2012-07-04 Thread Otto Stolz

Hello Naena Guru,

on 2012-07-04, you wrote:

The purpose of
declaring the character set as iso-8859-1 than utf-8 is to avoid doubling
and trebling the size of the page by utf-8. I think, if you have characters
outside iso-8859-1 and declare the page as such, you get
Character-not-found for those locations. (I may be wrong).


You are wrong, indeed.

If you declare your page as ISO-8859-1, every octet
(aka byte) in your page will be understood as a Latin-1
character; hence you cannot have any other character
in your page. So, your notion of “characters outside
iso-8859-1” is completely meaningless.

If you declare your page as UTF-8, you can have
any Unicode character (even PUA characters) in
your page.

Regardless of the charset declaration of your page,
you can include both Numeric Character References
and Character Entity References in your HTML source,
cf., e.g., http://www.w3.org/TR/html401/charset.html#h-5.3.
These may refer to any Unicode character, whatsoever.
However, they will take considerably more storage space
(and transmission bandwidth) than the UTF-8 encoded
characters would take.

Good luck,
  Otto Stolz





Re: Zero-width joiner won't join

2012-03-06 Thread Otto Stolz

Hello,

am 2012-03-05 21:10, schrieb Philippe Verdy:

For now it is
always unclear what is encoded in such HTML attachment, and we need to
use an external tool like od to see this information.


You can always open the attachment in an ordinary text editor,
such as Edlin in Windows.

With Edlin, Andreas Prilop’s example reads thusly:

!DOCTYPE html
html
titleZWJ/title
h1
#1607;zwj; zwj;#1607;zwj; zwj;#1607;/h1
h1 style=font-family: Tahoma
#1607;zwj; zwj;#1607;zwj; zwj;#1607;/h1
h1 style=font-family: Mangal
#1607;zwj; zwj;#1607;zwj; zwj;#1607;/h1
h1 style=font-family: Mangal, Tahoma
#1607;zwj; zwj;#1607;zwj; zwj;#1607;/h1
/html


I. e., U+0647, with ZWJ on either, or both, sides, in various
font families.

Best regards,
  Otto Stolz




Re: Zero-width joiner won't join

2012-03-06 Thread Otto Stolz

Hello,

on 3/5/2012 10:25 AM, Andreas Prilop wrote:

I think the zero-width joiner (ZWJ, U+200D) should join
regardless of typeface. But Internet Explorer 8 won't join
if the ZWJ is taken from another font than surrounding text.


Am 2012-03-05 23:49, schrieb Asmus Freytag:

My question would be this: do you know of any browser that can handle
your test case?


Firefox 10.0.2 apparently does it right;
Opera 10.52 does a half-hearted job;
and Internet Explorer 8.0.6001.18702 completely screws it up.
Cf. attached screen shots.

Best wishes,
  Otto Stolz

attachment: ZWJ-Opera.pngattachment: ZWJ-Firefox.pngattachment: ZWJ-IE.png

Re: Upside Down Fu character

2012-01-04 Thread Otto Stolz

Hello,

am 2012-01-04 11:57, schrieb vanis...@boil.afraid.org:

I tried out this code, it's simple html/CSS,

...

Specifically, the text 福div class=txtUpsideDown福/div福
renders two right-side-up 福 characters on one line,
with the upside-down 福 on the next line.


I have tested the textUpsideDown definition from
http://www.codeproject.com/Tips/166266/Making-Text-Upside-down-u
with three browsers:
 Firefox 8.0.1,
 Opera 10.52;
 Internet Explorer 8.0.6001.187702
The latter asks for the user’s consent to interpret scripts,
before it applies the .txtUpsideDown class definition.
Cf. attached source file, and attached screen shot for the results.

Apparently, none if these browsers provides the appropriate
room for the text set upside down; hence, the txtUpsideDown
class should be amended to comply with this shortcoming.
However, I have not found any e-mail address of its author,
“thatraja”, and I am not willing to sign on to codeproject.com,
Twitter, Facebook, or any other community, just to inform him.

As vanis...@boil.afraid.org already has observed,
the upside-down text either appears below the current line,
or overlaps other text in the current line. My test case
shows, that in the former case, the upside-down text will
overlap the following line (in this test a horizontal ruler).

Best wishes,
  Otto Stolz


attachment: Test_UpsideDown.pngTitle: Test txtUpsideDown


福福福

福福

Van Anderson Van Anderson Van Anderson

Van Anderson Van Anderson


Sorting and German (was: Sorting and Volapük)

2012-01-01 Thread Otto Stolz

Happy New Year,

on Sun, Jan 1, 2012 at 4:27 PM,
Michael Eversonever...@evertype.com wrote:  Volapük sorts [...] ä a 
separate letter after a, ö separate after o,

and ü separate after u.
does anyone know if any other language treats ä/ö/ü in the same way?


Am 2012-01-01 16:54, schrieb Peter Cyrus:

German does both,


Not really.

According to DIN 5007,
German features two different sort orders:
• In lists of personal names, Ä, Ö, Ü may be sorted
  as AE, OE, and UE, respectively; this order is
  mainly used in telefone directories.
• In dictionaries and encyclopedias, Ä, Ö, Ü are sorted
  as A, O, and U, respectively.

As encyclopedias may well comprise personal names,
the scope of the  former scheme is not well defined,
imho, and I stick to the latter one, whenever I have
to sort a list.

In both schemes, ß is sorted as SS.

In both schemes, true A (or AE, respectively) goes before Ä,
iff two sort keys are otherwise identical; likewise for
Ö, Ü, and ß.

In Austria, a third scheme is used in telefone directories
(but not in the yellow pages): Here, Ä, Ö, and Ü, are
indeed treated as distinct letters, to go between A and B,
O and P, and U and V, respectively; and ß is treated as a
distinct pair of letters, ro go between SS and ST.

Best wishes,
  Otto Stolz




Re: Fwd: Re: Samogitian E with dot above and macron

2010-10-27 Thread Otto Stolz

Hello,

by way of example, I had written:

Examples (from the keyboard driver I am currently using):
key “E” generates single character “e”;
key combination “⇧”+“E” generates single character “E”;
key “Ü” generates single character “ü”, but could as well generate the
canonically equvalent sequence “ü” (U+75, U+308);
key combination “AltGr”-“E” generates single character “€”
key sequence “^” + “a” generates single character “â”, but could as
well generate canonically equivalent “â” (U+61, U+302 )
key combination and sequence “AltGr”-“´” + “C” generates single
character “č”, but could as well … (you get the idea)


On Tue, 26 Oct 2010 11:10:16 +0200, Győző Dobner has asked:

Do you use the default keyboard driver of your operating system or
some third-party keyboard driver?


The latter, viz. “trans012” by Philipp Reichmuth. That keyboard driver 
used to be available from two WWW sites (at least), but is no more so.

Meanwhile, there is the “Europatastatur” http://www.europatastatur.de/
by Karl Pentzlin to suit the same purpose.


If you do use the default keyboard driver and it is this flexible,
then can you tell us what operating system you use


I am still using Windows XP SP 3.

Windows comes with a huge bag full of various keyboard drivers.
If no one of these happens to suit your needs, you can use third-party
software to define your own keyboard driver; however, I do know
next to nothing about this sort of software. At least, from the
examples given, it is evident that Windows’ keyboard interface,
and the keybord-driver generating software, are flexible enough.
Some of the available software is listed under
http://www.unicode.org/resources/keyboards.html.

Pentzlin’s driver can generate almost any conceivable character (or 
character sequence, respectively), if only it is used in a Latin-based

writing system. However, Pantzlin’s driver is mnemotechnically based
on the German standard keyboard layout; hence, if you use any layout
diferent from our QWERTZU layout, it will prove difficult to memorize
the various key assignments.


Your advice may be somewhat misleading if Arns uses something else.


As Arns had mentioned that he is pondering over his own, special
keyboard layout, I assume that he is well aware of the available
keybord-driver generating software.

My point was just that the various keystrokes are not necessarilly
mapped to single Unicode characters, and that keystroke sequences
are not necessarily mapped to character sequences. I hope well,
that Arns can exploit this freedom of the mapping to find an
ergonomically, and linguistically, pleasing keyboard layout for
the purpose at hand.

Best wishes,
  Otto Stolz




Re: Fwd: Re: Samogitian E with dot above and macron

2010-10-26 Thread Otto Stolz

Hello Arns Udovīčė,

on 2010-10-26, you have written:

This asking for new letter in Unicode was for purpose to make normal
keyboard layout (even two variants) for my nation.


Note that the keyboard layout does _not_ depend on the availability of 
composed letters in the target encoding (Unicode, in your case).


Rather, a keyboard driver can generate multiple Unicode characters for a 
single keystroke, as well as a single or even multiple characters for a 
sequence, or combination of keystrokes. Examples (from the keyboard 
driver I am currently using):

key “E” generates single character “e”;
key combination “⇧”+“E” generates single character “E”;
key “Ü” generates single character “ü”, but could as well generate the
canonically equvalent sequence “ü” (U+75, U+308);
key combination “AltGr”-“E” generates single character “€”
key sequence “^” + “a” generates single character “â”, but could as well
generate canonically equivalent “â” (U+61, U+302 )
key combination and sequence “AltGr”-“´” + “C” generates single
character “č”, but could as well … (you get the idea)

A decent Unicode-capable font is supposed to render the canonically 
equivalent character sequences indistinguishably (if it is not a

special font designed to reveal the exact Unicode character sequence,
for debugging purposes). You can use the above examples of cononically
equivalent sequences to test your fonts.

So, you could design your leyboard layout to suit best the writing-
habits of your community – and you will have to find a decent font
to display the Unicode characters (and sequences thereof), according
to the rules of your orthography.

Best wishes,
  Otto Stolz





Re: ,,semi-virgula''

2010-08-31 Thread Otto Stolz

Hello,

Am 2010-08-31 16:57, schrieb Janusz S. Bień:

Can the diacritic be interpreted
as an already exisiting combining character?


Perhaps:
  0326 Combining comma below
  0329 Combining vertical line below
  0337 Combining short solidus overlay

Cheers,
  Otto Stolz



Re: Historical Scandinavian currency signs (daler and mark)

2010-08-12 Thread Otto Stolz

Hello,

am 2010-08-11 21:33, schrieb Johan Winge:

http://forum.skalman.nu/download/file.php?id=19494
The top row (from a printed book) depicts signs for daler, mark and
skilling, respectively. As for skilling, I personally am content with
the Eszett, U+00DF, but I have not been able to figure out how to best
represent the other two with Unicode.



The Daler sign resembles closely the GERMAN PENNY SIGN (U+20B0).

Best wishes,
  Otto Stolz






Apostrophe in transliteration (was: Modifiers from punctuation)

2010-08-09 Thread Otto Stolz

Hello,

am 2010-08-08 18:56, schrieb António MARTINS-Tuválkin:

We all know why is good to have U+02BC separated from U+2019,


Which one is recommended, when transliterating, as the Latin equvalent
of the Cyrillic letter Soft Sign (044C)?

Thanks for any hints,
  Otto Stolz




Re: number padless?

2010-08-07 Thread Otto Stolz

Am 2010-08-07 04:19, schrieb Murray Sargent:

In general to type in a character by its Unicode value,
type in the hex value and then alt+x.


In some MS programs, e. g. the German version of MS Word,
it’s rather Alt-C, as Alt-X is endowed with some other meaning.

Best wishes,
  Otto Stolz



Re: Bangladeshi

2010-07-07 Thread Otto Stolz

Hello,

am 2010-07-07 08:21, schrieb William J Poser:

The principal language of Bangladesh is Bengali,
the same language and writing system as used in the Indian state of
West Bengal.


Cf.
http://www.unicode.org/faq/indic.html#14,
http://www.unicode.org/faq/indic.html#15,
http://www.unicode.org/faq/indic.html#18,
http://www.unicode.org/faq/indic.html#19,
http://www.unicode.org/faq/indic.html#20,
http://www.unicode.org/versions/Unicode5.2.0/ch09.pdf#G664195,
http://www.unicode.org/charts/PDF/U0980.pdf.

Tulasi, you should probably learn to peruse the FAQ, and other
information, provided on the Unicode WWW site:
• Start at http://www.unicode.org/faq/.
• To find a particular chapter in the current standard version,
  start at http://www.unicode.org/standard/standard.html,
  then follow the links to “Latest Version of the Standard”,
  then “*Unicode 5.2 Web Bookmarks*” (or some such, in the
  versions to come).
• To find particular character assignments, start at
  http://www.unicode.org/charts/.

Good luck,
  Otto stolz




Re: Latin Script

2010-06-18 Thread Otto Stolz

Hello Tulasi,

on 2010-06-18 04:24, you have asked:

Or do Unicode  ISO/IEC use different number  name for same letter/symbol?


You might find enlightening the FAQ on “Unicode and ISO 10646”
http://www.unicode.org/faq/unicode_iso.html.

Best wishes,
  Otto Stolz




Re: Hexadecimal digits A-F

2010-06-09 Thread Otto Stolz

Hello,

am 2010-06-09 15:10, schrieb Frédéric Grosshans:

I think adding the relevant few lines in the Archive of
Notices of Non-Approval http://www.unicode.org/alloc/nonapprovals.html
might be useful


Also an FAQ entry might be useful. I just have submitted a suggestion.

Best wishes,
  Otto Stolz




Re: Hexadecimal digits

2010-06-05 Thread Otto Stolz

Am 2010-06-05 00:04, schrieb Luke-Jr:

Base 16 is superior in many various ways, the most obvious being easier
division (both visibly and numeric).


This is a red herring, IMHO.

In the decimal systems, you can easier divide by 2, 5,
and powers of 10, whilst in the hexadekadic system,
you can easier divide by many powers of two, and all
powers of 16.

For arbitrary divisors, the decimal system seems to be
easier, as you would use the same division algorithm,
in both systems, however with different tables (dubbed
“multiplication table” or, less formally, “times table”)
that comprise 100 vs. 256 entries. Hence, the the hexa-
dekadic multiplication table should be 2½ times as hard
to learn, and memorize, as the decimal one.

Of course, a larger base needs less digits (on average)
for any given number; hence divisions for large numbers
tend to take less steps in the hexadekadic system than
in the decimal one; whether this will outweigh the larger
multiplication table to be used, is, I reckon, a matter
of taste. Somewhere, there must be an optimum: I cannot
imagine people to learn, and memorize, e. g., the 3600
entries of the multiplication table for base 60.

This whole deliberation is, of course, purely academic.
In real life, you will have to use the decimal system
as everybody else does, lest you wont be misunderstood.

You may wonder, why I am using the term “hexadekadic”.
This is because, “hexadeka” is the Greek word for 16,
whilst the Latin word ist “sedecim”; there is no language
known that has “hexadecim”, or anything alike, for 16.

Best wishes,
  Otto Stolz




Re: Hexadecimal digits

2010-06-04 Thread Otto Stolz

Hello Luke-Jr,

you’ve been asking:

Are there any hexadecimal digits in Unicode?


Simply use the digits “0” through “9”, and the
letters “A” through “F”; cf.
http://www.unicode.org/faq/casemap_charprop.html#13.



For example, perhaps the digits used for John W. Nystrom's Tonal System?


I had to consult:
http://en.wikipedia.org/wiki/John_W._Nystrom#Tonal_System_.28Hexadecimal.29,
to learn about this system.

Apparently, Nystrom's Digits for “9” through “F” are
not encoded in Unicode,
cf. http://www.unicode.org/charts/#symbols.

I do not know, how successful Nystrom’s proposal has been,
and I cannot assess whether his digits deserve to be
encoded, in Unicode. If you think, these digits need
to be encoded, you are free to propose that; for the
procedure required,
cf. http://www.unicode.org/faq/char_proposal.html.

In any case, it would be problematic to unify Nystrom’s “9”
through “F” cannot be unified with Unicode “9”, “A” through
“F” (treating them as a glyph-variation, and font-selection,
issue), for two reasons:
• Unicode “A” through “F” are also used for spelling ordinary
  words; this would not be feasable with Nystrom’s glyphs;
• Nystrom’s digit “A” looks exactly as the common, decimal
  digit “9”, which would render any special Nystrom font
  rather misleading to the reader.


Best wishes,
  Otto Stolz



Re: Least used parts of BMP.

2010-06-04 Thread Otto Stolz

Hello,

Am 2010-06-03 07:07, schrieb Kannan Goundan:

This is currently what I do (I was referring to this as the compact
UTF-8-like encoding).  The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):
   0xxx
   10xx xyyy
   110x xxyy yzzz


The problem with this encoding is that the trailing bytes
are not clearly marked: they may start with any of
'0', '10', or '110'; only '111' would mark a byte
unambiguously as a trailing one.

In contrast, in UTF-8 every single byte carries a marker
that unambiguously marks it as either a single ASCII byte,
a starting, or a continuation byte; hence you have not to
go back to the beginning of the whole data stream to recognize,
and decode, a group of bytes.

Best wishes,
  Otto Stolz





Re: Hexadecimal digits

2010-06-04 Thread Otto Stolz

Hello Luke-Jr,

please keep the discussion on the list.

I had written:

Simply use the digits “0” through “9”, and the
letters “A” through “F”;


You have written:

This makes it more complex to differentiate between numbers and
letters/units/etc.


In any case, you have to know the base of every number
you are going to parse. This stems from the fact that
the same digits are used for all number systems. Note
that Unicode is a character-encoding standard, hence
cannot do anything about this sort of ambiguity.

Best wishes,
  Otto Stolz



Re: Titlecasing iota subscript

2010-06-03 Thread Otto Stolz

Hello Karl Williamson,

you are asking:

But I don't understand the titlecasing part. Is it meant, when
titlecasing the base character the iota subscript is combined with? That
would also make sense, but then I would think it should move and become
a small letter iota instead, U+03B9.


Read the pertinent passage in the Unicode standard,
http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf#G29675.

Good luck,
  Otto Stolz




Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-13 Thread Otto Stolz
Mark Davis schrieb:
This is just a confusion among the hoi polloi.
And here we have yet another example: hoi is Greek for the
(hoi polloi = the many).
Best wishes,
   Otto Stolz


Re: Marking Elongated script in scholarly editions

2004-12-01 Thread Otto Stolz
Hallo,
Georg Vogeler schrieb:
a triple St. Andrew's Cross (like an x). 
As I didn't find anything like that at unicode.org I wonder if this 
character - or something likely - is already part of the unicode standard?
Alas, U+3E1A has one cross too much ;-)
What do you mean by triple: How are those crosses aligned?
If they are written side-by-side, you could use three instances of
one of the following characters:
U+00D7 Multiplication sign
U+02DF Modifier letter cross accent
If they, however, are on top of each other, there seems
to be no way but using some hack, such as
U+16DD U+0353
Or, perhaps, you could (ab)use asterisks, or dots instead, e. g.
U+2042 Asterism (i. e. 3 asterisks, in a triangle shape)
U+2051 Two asterisks aligned vertically
U+22EE Vertical ellipsis
Best wishes,
   Otto Stolz


Re: CGJ , RLM

2004-11-29 Thread Otto Stolz
Hi,
Philippe Verdy had written:
For example, a ligaturing opportunity can be encoded explicitly
in the French word efficace: ef+ZWJ+f+ZWJ+icace. [...]
in French there's a possible hyphenation at the first occurence,
where it is also a syllable break, but not for the second occurence
that occurs in the middle of the second syllable.
Doug Ewell wrote:
a system that is capable of high-quality typography [...]
should generate ff-type ligatures and perform  sensible hyphenation by default.
 You can then use ZWNJ to turn ligation *off* where it is not desired.
In German, however, a ligature must not span a syllable break.
How should I code plain text, w.r.t. hyphenation and ligatures?
- Huf + ZWNJ + lattich
- Huf + SYH + lattich
- Huf + SYH + ZWNJ + lattich
- Huf + ZWNJ + SYH + lattich
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- Brief+SYH+lasche (letter flap) vs.
- Brie+SYH+flasche (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
Best wishes,
   Otto Stolz



Re: CGJ , RLM

2004-11-29 Thread Otto Stolz
Hello,
I had written:
Note that there is no algorithm to reliably derive the position of the
syllable break from the spelling of a Word. You could even concoct pairs
of homographs that differ only in the position of the syllable break
(and, consequently, in their respective meaning). So far, I have only
found the somewhat silly example
- Brief+SYH+lasche (letter flap) vs.
- Brie+SYH+flasche (bottle to keep Brie cheese in),
but I am sure I could find better examples if I would try in earnest.
Peter Kirk schrieb:
Before our French members get upset at the idea that anyone might keep 
their famous cheese in bottles, let me remind the list of a similar pair 
we had before, although this affects only the less common st ligature:
Just because the st ligature is so uncommon (and the long  with its
t ligature is almost extinct), I was looking for an example involving
fl, or fi).
...
Wachs-tube (growth tube)
 (waxtube)
Best wishes,
   Otto Stolz



Re: Exporting Unicode UTF-8 from Word

2004-11-23 Thread Otto Stolz
Hello,
Cristian Secar wrote:
how do you export UTF-8 from MS Word ? AFAIK, the only
way to do that is to copy from Word  paste to Notepad, then save as
UTF-8.
AFAIK, this works only under Windows XP.
It does definitely not work under Win 98/NT.
Alternatively, copy from Word  paste to e-mail recipient, whose
encoding is set to UTF-8.
This does indeed work under Windows 98/ME/NT/2k/XP; I do not remember,
whether it also worked under Windows 95.
Christopher Fynn schrieb:
If you have a Word document with Unicode characters
choose: File Save As, Save as type: Plain Text
enter a file name, and click on Save
This should bring up a File Conversion dialog [...]:
Text Encoding:  [...] Select: Other Encoding
...
This works in Word 2000, and up, but not in Word 97.
In Word 97, the only Unicode encoding available is UCS-2LE
(plain text via File/Save as, File type=Unicode text;
the Word format is also UCS-2LE encoded).
Best wishes,
   Otto Stolz



Re: utf-8 and unicode fonts on LINUX

2004-11-23 Thread Otto Stolz
Kefas,
you have written:
I tried UTF-8 export to send an e-mail that contained 
several scattered unicode codepoints from the full 
16-bit range from   to  from XP+Word to the 
university's Linux/Mozilla/OpenOffice/Kmail, enabled 
UTF-8 support. With very disappointing results.
For UTF-8 (or any other encoding except ISO 646 IRV
(aka ASCII)) to survive the transport via e-mail
(RFC 2821), it must be tagged and transfer-encoded,
according to RFC 2045, and RFC 2047. For examples, cf.
http://www.systems.uni-konstanz.de/EMAIL/FAQ-SMTP.php#74
(in German). It is the e-mail clients' responsibilty
to do this tagging and encoding (on the sending side),
and the corresponding interpretation and decoding (on
the receiving side).
You have not mentioned, which e-mail client program
you have used, how it was configured, nor what the
result looked like. Hence, the cause of your very
disappointing results cannot be derived (nor even
guessed at).
1.  Do I expect too much assuming that UTF-8 just 
recodes the full 16-range in 8-bit but that 
text-programs with UTF-8 enabled should be able to 
reconstruct the full 16-bit range (as far as used)?
The Unicode range is much more than 16 bit (you need 21 bits
per character, but all 21 bit values are not used).
UTF-8 encodes every single character in 1 through 4 bytes;
cf. http://www.unicode.org/faq/utf_bom.html for more
details. I do not understand what you mean by recon-
struct, but I guess your question is answered in the
cited WWW page.
Best wishes,
  Otto Stolz


Re: Unicode HTML, download

2004-11-22 Thread Otto Stolz
Hi,
E. Keown schrieb:
If I add the proper Unicode-related HTML code at the
top, will people get Unicode-compatible text when
they download this? 
First, you must make sure that your HTML source is stored
in Unicode (preferably: UTF-8) encoding, at all. E. g.,
in Windows XP, you can compose your page in the Notepad
editor, use the menu item File/Save as, and then choose
the UTF-8 encoding. In Linux, you could use the Yudit
editor.
Secondly, you must make sure that your HTML source is properly
transferred to your HTTP server. When it is encoded in UTF-8,
virually any FTP, or SSH/SCP, client will do it correctly,
if you declare your source as text, or ASCII (as opposed
to Binary).
Thirdly, you must make sure that your HTTP server tells the
client (e. g. a browser) about the UTF-8 encoding. The usual
  meta http-equiv=Content-Type content=text/html; charset=utf-8
HTML header is only one way to accomplish this task; it de-
pends on your server's settings whether this will work, at all.
Some HTTP servers require particular filename extensions
to signal the encoding, you may have to store your source
in a particular directory, the above-mentioned HTML line may
suffice, you may have to provide particular configuration
settings in an additional file (e. g. .htaccess), or it even
may be just impossible to serve UTF-8 files from your server;
ask your server's administrator for the actual conventions
applying in your environment. In any case, you should include
the above-mentioned line with your HTML header, so the page will
display correctly if no HTTP server is involved, e. g. locally
on your own workstations before you have uploaded the source.
Is there a limit to how many separate writing systems
one can do this way?
This is not a question of how many, rather one of which ones.
The only limits are in capabilities of the browser your audience
is using (e. g. it may not be able to process RTL text), and in
the fonts available to said browser.
- In your HTML source, use only characters from the WGL4,
  cf. http://www.microsoft.com/OpenType/OTSpec/WGL4.htm;
  in your style sheet, ask for modern, WGL4-conforming fonts,
  cf. http://www.alanwood.net/unicode/fonts.html#wgl4.
- Test your page with several browsers, as recommended by other
  posters in this thread.
- I also recommend to test your page with the three W3C validators:
  http://validator.w3.org/, http://jigsaw.w3.org/css-validator,
  and http://validator.w3.org/checklink.
Good luck,
  Otto Stolz


Re: Unicode HTML, download

2004-11-22 Thread Otto Stolz
Hello,
I had written,
The only limits are in capabilities of the browser your audience
is using (e. g. it may not be able to process RTL text), and in
the fonts available to said browser.
- In your HTML source, use only characters from the WGL4,
  cf. http://www.microsoft.com/OpenType/OTSpec/WGL4.htm;
  in your style sheet, ask for modern, WGL4-conforming fonts,
  cf. http://www.alanwood.net/unicode/fonts.html#wgl4.
...
Peter Kirk has written:
This is not very helpful advice if you are wanting to put up a page in 
Hebrew (as Elaine explicitly is)
The original poster had mentioned a lot of languages, and for a moment,
I had forgotten about Hebrew being among them, so this advice (which I
have to give all too often, together with my other points) slipped in,
inadvertently. Of course, this advice is only valid for the text parts
written in European languages (which comprise the major part of the
list in the original poster); of course it does not apply to the Hebrew
column in Elaine's glossary. The URL for Hebrew fonts is
http://www.alanwood.net/unicode/fonts_windows.html#hebrew.
Peter Kirk has als rushed into the following conclusion:
In fact, what you seem to be saying is, only use European 
languages, and expect the rest of the world to learn them.
Not quite the attitude which Unicode is intended to promote.
In fact, Peter is grossly misinterprating my statement and
he is imputing to me an attitude I do not maintain, and never
have.
Back to the gist of my advice for Elaine: It does not help much
to simply add the proper Unicode-related HTML code at the top;
rather, you have to make sure that your HTML code is encoded properly,
and that the reader's browser will know about its encoding.
Best wishes,
  Otto Stolz


Re: Unicode HTML, download

2004-11-22 Thread Otto Stolz
Hello,
I had written:
- In your HTML source, use only characters from the WGL4,
  cf. http://www.microsoft.com/OpenType/OTSpec/WGL4.htm; 
Peter Kirk has written:
Well, of course if you are using languages which use only characters 
from WGL4, you will use only these characters.
You can well write in an European language, and still use non-WGL4
characters, e. g.
- U+03D0 through U+03D6, U+03F0, U+03F1, which one might wish to use
  for Greek (the reference glyphs of U+03D0 and U+03F0 were the
  way I have learned to write Theta and Kappa, respectively, in school),
- the whole Greek Extended block, which is still used by a school
  of contemporary Greek typography (though no more in the official
  orthography),
- U+2000 through U+200B: you have to use kbdnbsp;nbsp/kbd,
  and some such, to generate empty space,
- U+2010 -- believe it or not: WGL4 does not comprise the hyphen,
- some more General Punctuation, such as U+2012 (Figure Dash),
  U+2023 (Triangular Bullet), U+2027 (Hyphenation Point, used in
  dictionaries and glossaries, like Elaine's project), U+2032
  (Prime), U+2052 (Commercial Minus Sign),
- U+2070 through U+208E (Superscripts and Subscripts), except
  U+207F which is in WGL4 -- you will have to use HTML tags instead,
  which is more versatile, anyway,
- a major part of the vulgar fractions U+2153 through U+215F
  (1/3 is not in WGL4, but 3/8 is), and all Roman Numbers
  U+2160 through U+2183
More examples are left as an exercise to the gentle reader:
just compare http://www.microsoft.com/OpenType/OTSpec/WGL4E.HTM
to the pertinent ranges in THE BOOK.
When I wrote the advice quoted supra, I was mainly thinking of
some Punctuation and Symbols I'd like to use, which are not in WGL4.
Best wishes,
  Otto Stolz


Re: Unicode IDNs

2004-11-15 Thread Otto Stolz
Hello,
I had written:
OS | Browser| http://www..net/ | http://w.pl/
---+++-
Win XP | Opera 7.54 | OK | OK
SP2+++-
   | Netscape 6.2   | not found | not found
   +++-
   | IE 6.0 SP 2| not found* | not found*
   +++-
   | Firefox 1.0| OK | OK
---+++-
Win 98 | Netscape 4.77  | not found | not found
   +++-
   | Netscape 6.2.2 | not found | not found
   |++-
   | IE 5 SP 2  | not found | not found
---+++-
 http://www.??.net/
 http://www.%C9%99%C9%9B.net/
 http://www./??.net/
 http://%c5%bc%c3%b3%c5%82w.pl/
* Not found on 1st try after IE start,
  IE 6 hung on subsequent tries
Peter Kirk wrote:
The problem was that the browser was looking for http://www.%c9%99%c9%9b.net/
rather than http://www..net/, in other words exactly the same problem as Otto
found with Netscape 6.2.   The clipboard contained http://www.%c9%99%c9%9b.net/,
as both Unicode text and basic plain text. There was obviously a problem in how
Mozilla copied this address to the clipboard. 
In my tests, however, the address http://www..net/ was pasted correctly
into the respective browsers address field; only after attempting to
fetch the page, the distorted address appeared, either in the address
field or in the error message (depending on the browser).
Best wishes,
  Otto Stolz



Re: Unicode IDNs

2004-11-11 Thread Otto Stolz
More Browsers tested:
OS | Browser| http://www..net/ | http://w.pl/
---+++-
Win XP | Opera 7.54 | OK | OK
SP2+++-
   | Netscape 6.2   | not found | not found
   +++-
   | IE 6.0 SP 2| not found* | not found*
   +++-
   | Firefox 1.0| OK | OK
---+++-
Win 98 | Netscape 4.77  | not found | not found
   +++-
   | Netscape 6.2.2 | not found | not found
   |++-
   | IE 5 SP 2  | not found | not found
---+++-
 http://www.??.net/
 http://www.%C9%99%C9%9B.net/
 http://www./??.net/
 http://%c5%bc%c3%b3%c5%82w.pl/
* Not found on 1st try after IE start,
  IE 6 hung on subsequent tries
Best wishes,
   Otto Stolz








Re: Errors in TUS Figure 15.2?

2004-07-30 Thread Otto Stolz
Peter Kirk schrieb:
There appear to be two errors (not listed in the errata page 
http://www.unicode.org/errata/) in Figure 15.2 on page 391 of The 
Unicode Standard 4.0, the online version at 
http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf.

The fourth and last column of the table appears to be the same as the 
third column, except for the header row and the first content row 
referring to the fi ligature. But the forms in the second and third rows 
seem to be incorrect. (The forms in the fourth row should be the same.)

The fourth column is supposed to indicate the desired rendering of C1, 
ZWJ, C2. But in the text just before, ZWJ is specified as follows:

ZWJ requests that glyphs in the highest available category (for the 
given font) be used:
1. Unconnected
2. Cursively connected
3. Ligated
Read the paragraph immediately below that figure.
Cheers,
  OS


Re: UTF Magic Pocket Encoders

2004-07-09 Thread Otto Stolz
Hello,
I had written:
While not being ASCII proper, these MPEs use only characters
that were already present in CP 437 (the original PC code).
Doug Ewell wrote:
Unfortunately, neither the proper apostrophe  nor the copyright  and
trademark  symbols appear in CP437.
Sorry for my sloppy wording. I was basically referring to characters
that make the MPEs work, viz. the arrows and box drawing characters.
I had completely forgotten the other three symbols which only are
meant to make the appearence more pleasant. (The copyright  symbol
came into my UTF-16 MPE to fill an ugly gap, and the trademark 
symbol followed suit, as a little extra gag.)
I had also written:
I hope this will end the discussion on MPEs, which are toys,
after all (though they could also be used to visualize the
three UTF encodings).
Doug Ewell wrote:
Unicode toys are not always a bad thing.  They can be an aid to
understanding,
This is what I meant by visualization. I am planning
to use them in a tutorial on Unicode I have promised to
present to some of my colleagues.
Mike Ayers wrote:
I have used Marco's encoder to decode `od -t x1` streams, and keep
the encoders printed, glued, and handy for emergencies.
I will consider this approach for my own work ;-)
Normally, I use Notepad, and Command Debug, on my XP system
for quick conversions.
Best wishes,
  Otto Stolz



UTF Magic Pocket Encoders

2004-07-08 Thread Otto Stolz
Hello,
Dominikus Scherkl (MGW) wrote about Cima's UTF-8 Magic Pocket Encoder:
Oha?
Updated without changing version and date?
;-)
I had provided a Magic Pocket Encoder for UTF-16, and afterwards
have been made aware of some spelling, and wording, errors.
Mike Ayers has contributed the crowning achievement: his
UTF-32 Magic Pocket Encoder. This one is already perfect,
hence it will probably never reach version 1.1 :-)
Attached, you'll find the current versions of all three,
in a somewhat enhanced typography: I have exploited box-drawing
characters, arrows, and proper (typographical) apostrophes.
While not being ASCII proper, these MPEs use only characters
that were already present in CP 437 (the original PC code).
I haven't changed the wording, of course, exept the version
number and date, and the reference to arrows (rather than
exclamation points), as appropriate.
I hope this will end the discussion on MPEs, which are toys,
after all (though they could also be used to visualize the
three UTF encodings).
Cheers,
  Otto Stolz
Side 1 (print and cut out):

╔╦═══╦═══╦══╗
║ U+ ║ yy zz ║Cima’s UTF-8 Magic ║ Hex↔ ║
║ U+007F ║ ↓  ↓  ║Pocket Encoder ║ B-4  ║
║ YZ ║ _  _  ║   ║  ║
╟╫───╚═══╗
 Vers. 1.1 ║ 0↔00 ║
║ U+0080 ║ 3x xy │ 2y zz ║2004-06-30 ║ 1↔01 ║
║ U+07FF ║ 3_ __ │ 2_ ↓  ║   ║ 2↔02 ║
║XYZ ║ _  _  │ _  _  ║  M.C. ║ 3↔03 ║
╟╫───┼───╚═══╗
   ║ 4↔10 ║
║ U+0800 ║ 32 ww │ 2x xy │ 2y zz ║   ║ 5↔11 ║
║ U+ ║ ↓  ↓  │ 2_ __ │ 2_ ↓  ║   ║ 6↔12 ║
║   WXYZ ║ E  _  │ _  _  │ _  _  ║   ║ 7↔13 ║
╟╫───┼───┼───╚═══╣
 8↔20 ║
║ U-0001 ║ 33 0v │ 2v ww │ 2x xy │ 2y zz ║ 9↔21 ║
║ U-000F ║ ↓  0_ │ 2_ ↓  │ 2_ __ │ 2_ ↓  ║ A↔22 ║
║  VWXYZ ║ F  _  │ _  _  │ _  _  │ _  _  ║ B↔23 ║
╟╫───┼───┼───┼───╢
 C↔30 ║ 
║ U-0010 ║ 33 10 │ 20 ww │ 2x xy │ 2y zz ║ D↔31 ║ 
║ U-0010 ║ ↓  ↓  │ ↓  ↓  │ 2_ __ │ 2_ ↓  ║ E↔32 ║ 
║   WXYZ ║ F  4  │ 8  _  │ _  _  │ _  _  ║ F↔33 ║ 
╚╩═══╧═══╧═══╧═══╩══╝

Side 2 (print, cut out, and glue on back of side 1):

╔═══╗
║ Cima’s UTF-8 Magic Pocket Encoder - User’s Manual ║
║ (vers. 1.1, 2004-06-30, by Marco Cimarosti)   ║
║   ║
║ - Left column: min and max Unicode scalar values: ║
║   pick the row that applies to the code point you ║
║   want to convert to UTF-8. Letters V..Z mark the ║
║   hexadecimal digits that have to be processed.   ║
║ - Right column: hexadecimal to base-4 table.  ║
║ - Central columns: work area to compute each octet║
║   (1 to 4) that constitute UTF-8 octet sequences. ║
║ Convert each digit marked by V..Z from hex. to║
║ b.-4. Write b.-4 digits on the dots placed under  ║
║ letters v..z (two b.-4 digits per hex. digit).║
║ Convert 2-digit base-4 number to hex. digits and  ║
║ write them on the dots on the line. That is your  ║
║ UTF-8 sequence in hex. ↓ Downwards arrows show║
║ passages that may be skipped, either because the  ║
║ digit is hard-coded, or because it may be copied  ║
║ directly from the scalar value.   ║
╚═══╝Obverse: Print with a fixed-width font, such as Lucida Console,
and cut out.

╔╦═╦═╗
║ U+ ║ W  X  Y  Z  ║ Otto’s Magic Pocket Encoder ║
║ U+D7FF ║ ↓  ↓  ↓  ↓  ║ for  
UTF-16™╔═══╣
║   WXYZ ║ _  _  _  _  ║ ║V→vv │V→vv ║
╟╫─╢ 
Version 1.1 ║U→uu │U→uu ║
║ U+E000 ║ W  X  Y  Z  ║ ©2004-07-05 ║ tt←T│ tt←T║
║ U+ ║ ↓  ↓  ↓  ↓  ║ ║_←__ │_←__ ║
║   WXYZ ║ _  _  _  _  ║ ║ 
┼ ║
╟╫─╚═╣
0↔00 │ 13←8↔20 ║
║ U-0001 ║ 31 2t tu uv │ 31 3v Y  Z  ║ 00←1↔01 │ 20←9↔21 ║
║ U-000F ║ ↓  2_ __ __ │ ↓  3_ ↓  ↓  ║ 01←2↔02 │ 21←A↔22 
║
║  TUVYZ ║ D  _  _  _  │ D  _  _  _  ║ 02←3↔03 │ 22←B↔23 ║
╟╫─┼─╢
 03←4↔10 │ 23←C↔30 ║
║ U-0010 ║ 31 23 3u uv │ 31 3v Y  Z  ║ 10←5↔11 │ 30←D↔31 ║
║ U-0010 ║ ↓  ↓  3_ __ │ ↓  3_ ↓  ↓  ║ 11←6↔12 │ 
31←E↔32 ║
║   UVYZ ║ D  B  _  _  │ D  _  _  _  ║ 12←7↔13 │ 32←F↔33 ║
╚╩═╧═╩═══╝


:1:2:3:4:5:6..


Reverse: Cut out and paste on back of obverse.

╔╗
║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║
║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║
╠╣
║• Left column: min and max Unicode scalar values: pick the  ║
║  row

Re: UTF to unicode conversion

2004-07-02 Thread Otto Stolz
Hello,
Mike Ayers has written:
Who said that Unicode is high-tech?
Here is a device to generate UTF-8 that employs traditional tools such 
as ASCII art, paper, scissors, glue, brain.
Attached is a similar device for converting Unicode scalar values
to UTF-16 (UTF-16BE, that is, but you could easily add a final step
to compute UTF-16LE, or to add a BOM).
Definitely, the world has longed for this, for years ;-)  Enjoy!
Cheers,
  Otto Stolz
þÿAvers: Print with a fixed-with font, such 
as Lucida Console,

and cut out.



%T%P%P%P%P%P%P%P%P%P%P%P%P%f%P%P%P%P%P%P%P%P%P%P%P%P%P%f%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%W

%Q     U+0000 %Q W  X  Y  Z  %Q Otto s Magic 
Pocket Encoder     %Q

%Q     U+D7FF %Q !“  !“  !“  !“  %Q for  
UTF-16!%T%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%c

%Q       WXYZ %Q _  _  _  _  %Q             %Q 
   V!’vv %    V!’vv %Q

%_%%%%%%%%%%%%%k%%%%%%%%%%%%%%b Version 1.0 %Q 
   U!’uu %    U!’uu %Q

%Q     U+E000 %Q W  X  Y  Z  %Q ©2004-07-02 %Q 
tt!T    % tt!T    %Q

%Q     U+FFFF %Q !“  !“  !“  !“  %Q             %Q 
   _!__ %    _!__ %Q

%Q       WXYZ %Q _  _  _  _  %Q             %Q 
%%%%%%%%%%%%%%%%% %Q

%_%%%%%%%%%%%%%k%%%%%%%%%%%%%%Z%P%P%P%P%P%P%P%P%P%P%P%P%P%c 
   0!”00 % 13!8!”20 %Q

%Q U%P00010000 %Q 31 2t tu uv % 31 3v Y  Z  %Q 
00!1!”01 % 20!9!”21 %Q

%Q U%P000FFFFF %Q !“  2_ __ __ % !“  3_ !“  !“  %Q 
01!2!”02 % 21!A!”22 %Q

%Q      TUVYZ %Q D  _  _  _  % D  _  _  _  %Q 
02!3!”03 % 22!B!”23 %Q

%_%%%%%%%%%%%%%k%%%%%%%%%%%%%%%%%%%%%%%%%%%%b 
03!4!”10 % 23!C!”30 %Q

%Q U%P00100000 %Q 31 23 3u uv % 31 3v Y  Z  %Q 
10!5!”11 % 30!D!”31 %Q

%Q U%P0010FFFF %Q !“  !“  3_ __ % !“  3_ !“  !“  %Q 
11!6!”12 % 31!E!”32 %Q

%Q       UVYZ %Q D  B  _  _  % D  _  _  _  %Q 
12!7!”13 % 32!F!”33 %Q

%Z%P%P%P%P%P%P%P%P%P%P%P%P%i%P%P%P%P%P%P%P%P%P%P%P%P%P%g%P%P%P%P%P%P%P%P%P%P%P%P%P%i%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%]





....:....1....:....2....:....3....:....4....:....5....:....6..





Revers: Cut out and paste on back of avers.



%T%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%W

%Q     Otto s Magic Pocket Encoder for 
UTF-16 Version 1.0     %Q

%Q     User s Manual     (inspired from Cima 
s UTF-8 MPE)     %Q

%`%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%c

%Q  Left column: min and max Unicode scalar 
values: pick the  %Q

%Q  row that applies to the code point to 
be converted.       %Q

%Q  T Z mark the hexadecadic digits that 
have to be processed.%Q

%Q  Central column: work area to compute 
UTF-16BE code units. %Q

%Q  Right column: hexadecadic to quaternary 
conversion tables:%Q

%Q  ! for T to tt; !” for U/V to uu/VV (step 
1) and for step 2.%Q

%Q1. Convert each digit marked by T V from 
hex to quat. Write %Q

%Q   quat digits on the undersores placed 
under letters t v.  %Q

%Q2. Convert 2-digit quat numbers to hex 
digits or copy digits%Q

%Q   W Z, as indicated, and write them  on 
the underscores on %Q

%Q   the next line. That s your UTF-16BE 
sequence in hex.     %Q

%Q!“ Downwards arrows sh

Long S in Germany (was: 0364 COMBINING LATIN SMALL LETTER E)

2004-01-08 Thread Otto Stolz
Hello, and best wishes for the new year.

Gerd Schumacher wrote:
The long s [...] has been abandoned from the Roman alphabet in Germany
in the mid of the 19th century.
You mean the 20th century, don't you?

I have a facsimile reprint of the 1914 issue of Zupfgeigenhansel
(a renowned song-book), which is set in Roman type (Antiqua, in
German) and uses the long-S consistently, according to German
orthographic rules.
If I am not mistaken, both Roman type (Antiqua) and Gothic type
(Fraktur) were used concurrently up to 1941 when Gothic type was
banned by the government; likewise, Latin and German handwriting
were used concurrently. (In the 1930s, the very same government
had pushed the usage of Gothic type.)
Usually there is no ß on Swiss typewriters, because in Swiss pronounciation
there are many syllable-boundaries between the two s-parts of the common
German ß.
There is no syllable-boundary within the ß, as it signifes a single
sound. I can think of no example where the Swiss pronounciation is
different, in this respect.
In compounds, two s characters from different constituents may
happen to stand side by side, as in aussprechen (from aus-
sprechen), but these are never ever replaced with an ß letter.
The rationale for the Swiss keyboard design is that the accented
characters (for French and Italian) were less dispensable than the
ß (only used in German, and easily replaced with the ss Digraph).
Again, best wishes,
   Otto Stolz



Re: Swastika to be banned by Microsoft?

2003-12-15 Thread Otto Stolz
[EMAIL PROTECTED] wrote:
It is said that one who ignores history is doomed to repeat it.
George Santayana (1863-1952) actually said: Those who cannot remember 
the past are condemned to repeat it. (from Life of Reason I).

Or, we might consider that the same characters used to represent 
holy books or love poetry can also render 'Mein Kampf'.
You cannot devise an aphabet incapable of spelling swear-words.

Cheers,
  Otto Stolz



Re: Tamazight/berber language : How to send mail, write word documents ....

2003-06-06 Thread Otto Stolz
Azzedine Ait Khelifa wrote:

how i can send 
mail, write word document using Arial Unicode MS font ?
Cf. http://www.alanwood.net/unicode/.

After having read, tested, and understood (at large) these pages,
try and ask more specific questions on the Unicode list.
Good luck,
  Otto Stolz



Re: Unicode-compliant email manager on XP system

2003-06-03 Thread Otto Stolz
Karljürgen Feuerherm wrote:

I've been using Outlook XP on Win XP to manage email, but have had
intermittent difficulty in sending attachments, which seem sometimes to not
arrive, sometimes to arrive corrupted, and sometimes to arrive without
corruption. I seem to think it used to work, but then became corrupted
(though that might well be an illusion); but uninstal and reinstal has not
solved the problem. So


Rick McGowan wrote:
 The main problem with Outlook that I think Karljuergen is referring to
 has   to do with those darned winmail.dat blobs. [...] Can't that be
 turned off??
Cf. http://support.microsoft.com/default.aspx?scid=KB;en-us;q241538.

Karljürgen Feuerherm continued:
I'm looking for some other product which might suit the purpose, either free
or at least not expensive.


Mozilla 1.3 comprises a very reasonable e-mail client.

Best wishes,
  Otto Stolz



Re: Unicode-compliant email manager on XP system

2003-06-03 Thread Otto Stolz
Christopher John Fynn wrote:

Using Outlook or MS Word to send rich email resuts in
winmail.dat blobs being sent - but Outlook Express does not
seem to generate these as it uses HTML and multi-part MIME.


Outlook Express does not even understand them. TNEF is con-
fined to the Office 2000 arena.
Again, I recommend to read
http://support.microsoft.com/default.aspx?scid=KB;en-us;q241538.
Best wishes,
  Otto Stolz



Specifying the character encoding (was: Announcement: New Unicode Savvy Logo)

2003-05-31 Thread Otto Stolz
William Overington wrote:

1.  I tried out the validation procedure on the following page.
http://www.users.globalnet.co.uk/~ngo/font7007.htm
It will
not validate.  It is not clear to me what I need to add to the page to get
it to validate.


RTFM:
http://www.w3.org/TR/html401/struct/global.html#h-7.2
and http://www.w3.org/TR/html401/charset.html#h-5.2.2.
Cheers,
  Otto Stolz



Re: Re: book end or enclosing characters in most languages?

2003-05-29 Thread Otto Stolz
Doug Ewell wrote:

The actual characters used for these purposes vary, not only by script
but also by language and even country,


E. g., the very same character,
U+201C LEFT DOUBLE QUOTE QUOTATION MARK, is used
- as opening-quote mark, in English (USA),
- as closing-quote mark, in German (DE).
So, there is not comprehensive list of openers vs. closers possible.

Best wishes,
   Otto Stolz
PS. In these tow languages, the quote-marks are paired thusly:
 en_US: U+201C ... U+201D,  and U+2018 ... U+2019
 de_DE: U+201E ... U+201C,  and U+201A ... U+2018



Re: Detecting UTF-8 Locale Question

2003-03-26 Thread Otto Stolz
Edward H Trager wrote:

(1) Is examination of the LC_CTYPE environment variable on UNIX-like
environments a sufficient way of detecting locale?
No, see http://docs.sun.com/db/doc/806-0627/6j9vhfn5f?a=view.

See also:
- http://docs.sun.com/db/doc/806-0634/6j9vo5am0?a=view
- http://docs.sun.com/db/doc/806-0634/6j9vo5akm?a=view
- http://docs.sun.com/db/doc/806-0624/6j9vek59d?a=view
Best wishes,
  Otto Stolz



Re: New document.

2003-03-17 Thread Otto Stolz
Yung-Fong Tang wrote:

could you point out which symbol in that two images need to be proposed?
either by using red ciricle on the image or tell use the surrounding text.
Thanks


http://www.rz.uni-konstanz.de/Antivirus/tests/genealog/




Re: New document.

2003-03-14 Thread Otto Stolz
Dominikus Scherkl had written:
 Has anybody meanwhile contributed a proposal
 regarding the missing genealogical symbols ?
 (after Otto Stolz's message from 24.02.2003 I wondered this
 was not proposed long ago - or was it and was rejected for
 some reason?!?).
So do I -- however, I am not in the position to author
a formal proposition: I am not able to provide a font,
and I do not have the time for a research beyond the
evidence I have already given.
Rick McGowan wrote:

Do you have any documentation on these?


The two scans under
  http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
  http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden Rechtschreibung der deutschen Sprache
und der Fremdwörter / hrsg. von d. Dudenred. auf d. Grundlage
d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze-
Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb.
u. erw. Aufl. ISBN: 3-411-20900-3.
Best wishes,
  Otto Stolz



Re: Need encoding conversion routines

2003-03-14 Thread Otto Stolz
askq1 askq1 wrote:

Actually my requirement is striaght-forward/common and I believe it 
should be available somewhere on net.
In particular I need source code (or some way) for following requirements:
- Convert Unicode code-point to UTF8 encoding and vice-versa.
- Convert Unicode code-point to UCS2 encoding and vice-versa.
- Convert Unicode code-point to UTF16 encoding and vice-versa.
http://czyborra.com/utf/
ftp://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c
Cheers,
  OS



Re: ISO 8859_2 and Windows 1250

2003-03-13 Thread Otto Stolz
I had written:
CP 1250 contains the ISO 8859-1 characters, hence it is not
suited for slavic laguages. 


Eric Muller wrote:

I suspect that Otto meant to type CP 1252 contains...


Of course.  Thanks for the correction.

Cheers,
  OS



Re: ISO 8859_2 and Windows 1250

2003-03-12 Thread Otto Stolz
SRIDHARAN Aravind wrote:

What is the basic difference between ISO 8859_2 and Windows 1250?


CP 1250 contains all ISO 8859-2 characters, but some of them
in different code positions, plus about two dozen characters,
mostly from Unicode's General Punctuation range.
Cf. http://czyborra.com/charsets/iso8859.html#ISO-8859-2
and http://czyborra.com/charsets/codepages.html#CP1250.
Also what is the basic difference between Cp1250 and Cp1252.


CP 1250 contains the ISO 8859-1 characters, hence it is not
suited for slavic laguages.
Cf. http://czyborra.com/charsets/iso8859.html#ISO-8859-2
and http://czyborra.com/charsets/iso8859.html#ISO-8859-1.
What should I do for proper display of data in browser

for [Polish and Czech]?


1. Encode your text in either ISO 8859-2 or UTF-8.
   Both of these are proper ASCII supersets, so there is no
   problem for your HTML, or CSS, tags.
2. Label your HTML pages properly with the encoding used,
   cf. http://www.w3.org/TR/html401/charset.html#h-5.2.2.
3. Make sure that your HTTP server does not contradict your
   charset label in the HTTP headers it will garnish your
   HTML code with. In case, you are using  Apache 1.3, cf.
   http://httpd.apache.org/docs/mod/core.html#adddefaultcharset,
   http://httpd.apache.org/docs/mod/mod_mime.html#addcharset,
   http://httpd.apache.org/docs/configuring.html, and
   http://httpd.apache.org/docs/configuring.html#htaccess.
Best wishes,
  Otto Stolz



Re: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Otto Stolz
Kenneth Whistler wrote:

we can
calculate the weight as being *approximately* 9.05 pounds
(avoirdupois) [or 10.99 troy pounds].
Apparently a weighty publication, that forthcoming Unicode standard...

Cheers,
  Otto Stolz



Re: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Otto Stolz
Marco Cimarosti wrote:

the mass or weight
of a book do not change depending on whether someone is reading it or not.
Consequently, the same weight corrections need to be applied also if someone
*throws* the standard in a deep cave.
Beware: When the book is thrown at a large speed, the relativistic
effects must be taken into account. I hope that the editors took
pains to find a wording that will not upset anybody to the extend
that he would throw the book away at a considerable fraction of
the speed of light...
Best wishes,
  Otto Stolz



Re: symbols for `born' and `died' + guarani sign

2003-02-24 Thread Otto Stolz
Lukas Pietsch wrote:

Actually, the symbol and several others already exist and seem to be
standardized. The Duden, the authoritative source on German
orthography, describes them in its section on typesetting practices,
under the heading of genealogical symbols. Besides the asterisk
(born), the dagger/cross (died), and the two overlapping rings (married)
it lists:
wavy horizontal line (= baptized)
a single ring (= engaged)
two rings separated by a vertical bar (= U+29DE?) (= divorced)
two rings joined by a horizontal line (= extramarital)
two swords crossed (= died in combat)
rectangle (= buried)
urn symbol (= cremated)
The married symbol, by the way, typically differs from the infinity
symbol, as it consists of two overlapping circles, not just circles
touching each other. The born and died symbol, on the other hand,
are clearly identical with the normal typographical asterisk and dagger.
The Duden also makes it clear that these are all for use in inline text
(knnen in entsprechenden Texten zur Raumersparnis verwendet werden).
I haven't got a scanner here, else I might put up a scan somewhere.


http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
From: Duden Rechtschreibung der deutschen Sprache und der Fremdwrter 
/ hrsg. von d. Dudenred. auf d. Grundlage d. amtl. Rechtschreibregeln. 
[Red.Bearb.: Werner Scholze-Stubenrecht unter Mitw. von Dieter Berger 
...]. - 19., neu bearb. u. erw. Aufl.
ISBN: 3-411-20900-3,

p. 73

Best wishes,
   Otto Stolz




Re: Character display problem in browser

2003-02-17 Thread Otto Stolz
SRIDHARAN Aravind wrote:


When the character set in browser is Central European(Windows-1250),



then small a with ogonek(\u0105) comes fine.
When the character set in browser is Central European(ISO-8859-2),



then small a with ogonek(\u0105) comes like s with caron(\u0161).


Cf.

http://czyborra.com/charsets/codepages.html#CP1250

http://czyborra.com/charsets/iso8859.html#ISO-8859-2

Best wishes,
  Otto Stolz





Re: Unicode and Encoding Problems in Browsers

2003-02-07 Thread Otto Stolz
Muhammad Asif wrote:


When i tried to type, there are certain characters that are not displayed.



Instead rectangles are displayed.



This is the typical behaviour if the font in use does not comprise
a particular character.

Either, there is no suitable font installed on your system,
or the WWW page to be displayed requires a particular, yet
unsuitable font.

Cf. http://www.alanwood.net/unicode/ for more info on fonts
and browsers.

Best wishes,
  Otto Stolz





Re: LATIN LETTER N WITH DIAERESIS?

2003-02-03 Thread Otto Stolz
Asmus Freytag had written:


I have updated my document at http://www.unicode.org/~asmus/what_is_this_character.pdf


...


I welcome [...] any help anyone could provide in identifying the characters
or in locating places they are used.


Lukas Pietsch wrote:

Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?


 Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C.

Cf. the Ausgangsschrift tought at German schools, viz.
http://www.dietschweiler.de/SUETTER/schrift.gif
(1915 through 1941), and
http://www.pelikan-lehrerinfo.de/lehrerinfo/shoppix/shopitem151big.gif
(1953 through now (but there have been more recent alternatives, viz.
shopitem150big.gif, shopitem152big.gif, shopitem155big.gif)).


I am not entirely convinced that S and C are the intended meanings.
The left-hand stroke of F725 is far too high for a capital S,
and also the position of the left-hand stroke of F7AA does not look
quite right for a C.

Based on their code positions, I think, the F725 and F7AA characters
are meant as Variants of d, and T, respectively.

F725 resembles U+20B0 GERMAN PENNY SIGN, which is probably a script d,
derived from the Latin word denarius. (Just add an upstroke on the
left hand of the Verdana PUA character.)

This is not convincing either, I know. Just my 0,02 ¤.

Best wishes,
  Otto Stolz





Re: Small Latin Letter m with Macron

2003-01-17 Thread Otto Stolz
Kenneth Whistler had written:

Handwritten forms and arbitrary manuscript abbreviations
should not be encoded as characters. The text should just
be represented as m + m. Then, if you wish to *render*
such text in a font which mimics this style of handwriting
and uses such abbreviations, then you would need the font
to ligate mm sequences into a *glyph* showing an m with
an overbar.



I had replied:


This will not work, as all 'mm' occurences are not written as
m-overbar. [example snipped]



What I wanted to convey is that these abbreviations cannot be
globally applied to a text, as they are governed by morphological
issues (possibly also by typographical considerations, or may be
just arbitrary, as Ken had suggested). So, they should be somehow
coded in the text, whenever the user wants to preserve them, akin
to the notorious Wachstube vs. Wachſtube example. A glyph-
substitution automatism is not apt for this sort of happening.

John Hudson wrote:

Ken's suggestion works fine, but only on discreetly selected runs of 
text. In other words, it would be up to the user *not* to apply the 
glyph substitution layout feature in the circumstances Otto describes.


[...] Obviously this is not a plain text solution: markup is required.


On the contrary, I think this is a text feature and not a mere rendering
issue. Hence, I see two possible solutions:
- mark the abbreviation with a particular character (or character sequence),
  e. g., U+006D U+0304 (abbr.) vs. U+006D U+006D (plain), or
- mark the plain (unabbreviated) occurence of the characters,
  e. g., U+006D U+U+200B U+006D (plain) vs. U+006D U+006D (abbr.).

I'd prefer the former one, because it marks the deviation from the
prevalent usage.

Best wishes,
  Otto Stolz





Re: Small Latin Letter m with Macron

2003-01-16 Thread Otto Stolz
Christoph Päper had asked:

there has been a
tradition (in handwritten text more than in print) of writing mm  as only
one m with a macron above. I can't find any such character in Unicode,



You could of course build something similar with m+U+0305 to resemble the
look, but that won't become mm (just m or m¯) after a conversion to
e.g. ISO-8859-1.



This depends on the program used to do the conversion. When you want to
properly handle a particular writing tradition, you cannot rely on off-
the-shelf tools, unaware of the particular requirements.


Kenneth Whistler wrote:

Handwritten forms and arbitrary manuscript abbreviations
should not be encoded as characters. The text should just
be represented as m + m. Then, if you wish to *render*
such text in a font which mimics this style of handwriting
and uses such abbreviations, then you would need the font
to ligate mm sequences into a *glyph* showing an m with
an overbar.



This will not work, as all 'mm' occurences are not written as
m-overbar. E. g., G. Keller's Die drei gerechten Kammacher
http://gutenberg.spiegel.de/keller/seldwyla/kammachr/kammachr.htm
could not be written with m-overbar, as the two m characters
belong to different syllables; in modern orthography, you would
write Kammmacher, or -- if you wish so -- Kam-overbarmacher.

So, if you want to render m-overbar, you would have to mark it
in text, and the only way Unicode has to offer, is U+006D U+0304.
(I would not use U+0305, as this is too high and too wide.
I reckon, a good rendering engine should adapt U+304's width to
the pertenent base character's width.)


To do otherwise, either representing the plain text content
as m, combining-macron or with a newly encoded m-macron
character, would just distort the *content* of the text,
which is what the character encoding should be about.



It would not distort the content of the text for readers that
are accustomed to this sort of abbreviation -- no more than
the spelling i. e. would distort the content of that is
for an average English reader.


If and only if an m-macron became a part of the accepted,
general orthography of German



It used to be.

Markus Scherer wrote:


I can confirm the use of m+overline from my family, [...]
I always considered those personal variations, font styles if you wish.



Now I know that the m+overline was used elsewhere,


In German handwriting (Kurrent), the sequences of the letters m,
n, u, and i look very confusing: an ü is written exactly
as ii would be (if it ever were) written; the u needs a hook
above (akin to U+0306, but it is an intrinsic part uf the Kurrent
u glyph) to distinguish it from the n, and mm cannot be
distinguished from nnn (which came into German orthography only
in 1996, so there was no ambiguity when Kurrent was widely used).
Try to read, e. g., the penultimate word of the first line of
H. Carossas poem in http://www.e-welt.net/BfdS/schrift.htm#lese:
it's immer (and the penultimate word of the 2nd line reads
Brunnens).

This problem was even worse  with the medieval Textura font.

Hence, medieval scribes developed a rich set of abbreviations,
including the overbar for an omitted m or n. The latter has
survived into German handwriting, at least until the 1st half of
the 20th century.

Best wishes,
  Otto Stolz

PS. Never write Hawaii in German Kurrent ;-)





Re: Small Latin Letter m with Macron

2003-01-16 Thread Otto Stolz
Dominikus Scherkl wrote:


i. e. is an latin abbreviation for in exemplum meaning for example
not that is.



i. e. = id est = that is
e. g. = exempli gratia = for example

Cassel's English-German Dictionary, ISBN 0-02-522920-6, also says so.

Best wishes,
  Otto Stolz








8-bit MIME (was: Documenting in Tamil Computing)

2002-12-17 Thread Otto Stolz
Dear all,

Barry Caplan had written:

SMTP [...] is not 8 bit clean. It is very
clear in the RFCs that only 7bit data is allowed over the wire.


Stephane Bortzmeyer wrote:

All these extensions are referenced in the same RFC, 2821, which is
the authoritative one about SMTP.



As of November 2002, RFC 2821 is still a Proposed Standard, and RFC 821
is the Standard Protocol (cf. http://rfc.sunsite.dk/rfc/rfc3300.html).


The most important for us is 8BITMIME:



Section 2.3.1 of RFC 2821, the proposed standard, says:
| The content is textual in nature, expressed using the US-ASCII
| repertoire [1]. Although SMTP extensions (such as 8BITMIME [20])
| may relax this restriction for the content body,

Stephane Bortzmeyer quoted section 2.4 of RFC 2821:
 Eight-bit message content transmission MAY be requested of the server
 by a client using extended SMTP facilities, notably the 8BITMIME
 extension [20].  8BITMIME SHOULD be supported by SMTP servers.

SHOULD does definitely not mean the same thing as MUST.
An SMTP server does not have to support 8-bit MIME mail.

And the remainder of the quoted paragraph requests proper MIME
headers for 8-bit text:
| However, it MUST not be construed as authorization to transmit
| unrestricted eight bit material.  8BITMIME MUST NOT be requested
| by senders for material with the high bit on that is not in MIME
| format with an appropriate content-transfer encoding; servers
| MAY reject such messages.

Barry Caplan had written:

But for arbitrary email from one address to another, you can't rely on it.


Stephane Bortzmeyer wrote:

I send Latin-1 (ISO 8859-1) emails for more than ten years (and
without using quoted-printable or other similar hacks) to
French-speaking people in various parts of the world and I'm still
waiting for an actual problem.


Mere luck, I'd say, but no proof at all.

I have seen many messages, originally in ISO-8859-1-encoded French,
that got the high-bit of every accented character chopped off, thus
replacing é with i, î with n, and so forth. And even more mail
in German, distorted in a similar way. This has provoked an entry in
my E-Mail FAQ: http://www.systems.uni-konstanz.de/EMAIL/FAQ.php#SMTP-73.

Of course, more and more SMTP servers support 8-bit MIME, and many
take the pains to transform 8-bit MIME to some transfer-encoding
supported by the receiving server. If you are located behind a server
that recodes your 8-bit mail, you cannot claim that 8-bit mail is
supported everywhere; you can only claim that your server compensates
for the incompatibility of your MUA and the world at large.

Best wishes,
  Otto Stolz





Re: Mapping from HTML to Unicode

2002-12-16 Thread Otto Stolz
Hermes Glarner wrote:

I was in search for a mapping table from HTML entities (such 
as nbsp;) to unicode charachters (for the eg A0), but 
had no luck up to now. Maybe you are able to help?

The official answer from the current HTML spec:
http://www.w3.org/TR/html401/sgml/entities.html.

HTML 4.01 is defined in http://www.w3.org/TR/html401/;
recommended reading for every WWW author!

Best wishes,
  Otto Stolz





  1   2   3   >