Re: Letterforms based on p

2003-06-07 Thread Lukas Pietsch

 I was hoping to find someone who had additional evidence for this
character.

I happened to come across it the other day in a modern printed edition
of 17th- to 19th century handwritten English letters (Miller, Kerby
A., Arnold Schrier, Bruce D. Boling,  David N. Doyle. 2002. _Irish
immigrants in the land of Canaan: letters and memoirs from colonial
and revolutionary America, 1675-1815._ Oxford: Oxford UP) I haven't
got it here just now, but if it is important I might be able to
provide a few scans.

If I remember correctly, it was being used just as an handwritten
ligature of the word per, as in per day, per year, etc.

Lukas





Re: book end or enclosingcharacters in most languages?

2003-05-30 Thread Lukas Pietsch
 
 are they the right way round? so in german it'd be:
 
 otto said So, there is not comprehensive list of
 openers vs. closers 
 possible.
 
 Does not look right here. The following is more like it:
 
 So, there is not comprehensive list of openers vs.
 closers possible.
 

No, as far as I can tell the original version is the
correct one. Look at it with a font other than Courier New,
which has a rather uncommon glyph for the German closing
quotes. Times New Roman is much more representative. 

Lukas






Re: symbols for `born' and `died' + guarani sign

2003-02-24 Thread Lukas Pietsch
 For the married symbol use the mathematical infinity symbol:
 U+221E (no pun intended).
 Indeed, one could go a step further and introduce (?) a symbol for
divorced:
 Either one of the following offers itself as a candidate:
 U+29DC INCOMPLETE INFINITY
 U+29DE INFINITY NEGATED WITH VERTICAL BAR

Actually, the symbol and several others already exist and seem to be
standardized. The Duden, the authoritative source on German
orthography, describes them in its section on typesetting practices,
under the heading of genealogical symbols. Besides the asterisk
(born), the dagger/cross (died), and the two overlapping rings (married)
it lists:
wavy horizontal line (= baptized)
a single ring (= engaged)
two rings separated by a vertical bar (= U+29DE?) (= divorced)
two rings joined by a horizontal line (= extramarital)
two swords crossed (= died in combat)
rectangle (= buried)
urn symbol (= cremated)

The married symbol, by the way, typically differs from the infinity
symbol, as it consists of two overlapping circles, not just circles
touching each other. The born and died symbol, on the other hand,
are clearly identical with the normal typographical asterisk and dagger.

The Duden also makes it clear that these are all for use in inline text
(knnen in entsprechenden Texten zur Raumersparnis verwendet werden).
I haven't got a scanner here, else I might put up a scan somewhere. I
haven't found the time yet to look up if any more of these are already
in Unicode, but I don't remember having seen them.

Lukas







Re: LATIN LETTER N WITH DIAERESIS?

2003-02-02 Thread Lukas Pietsch
 All characters are now mapped to Unicoe characters or character
sequences
 where I felt that this was possible. If there are obvioous errors,
please
 point them out and I'll update the listing.

 However, there are some unidentified characters, or ones that could be
 considered missing from Unicode  4.0, or which have mappings that for
one
 or the other reason could be considered not ideal. These have been
 highlighted. I welcome suggestions for additions to or subtractions
from
 this list, plus any help anyone could provide in identifying the
characters
 or in locating places they are used.

Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S,
(compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an
S like this in school. Perhaps it's used somewhere in mathematics?

Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C.

Your F747, spacing left hook below - doesn't it look very much like the
palatalization hooks used elsewhere in the list (which you mapped to
U+0321)?

Your combinations with latin small letter dotless i (e.g. F704, F731,
F77A) seem to be designed for use in phonetic transcriptions, and hence
are probably intended as IPA U+026A;LATIN LETTER SMALL CAPITAL I

F737: the description in your list doesn't match the glyph shown, which
is with triangular colon.

F70F Latin small letter a with colon shows a triangular colon glyph
and should hence be mapped to U+02D0, not U+003A.

F70E Latin small letter a with tilde with modifier letter triangular
colon shows a U+0251 Latin small letter alpha glyph.

F750 Latin small letter i with palatalized hook below shows an
inverted breve glyph, not a hook.

F751 Latin small letter i with tilde with tilde shows a macron and a
tilde

F754 and F755 Latin small letter J with... show i, not j glyphs.

F79B Latin small letter S with retroflex hook below shows not a
retroflex hook, but something more like an ogonek. A retroflex hook
should be attached to the left side of the S, not in the middle below,
and has its own precomposed IPA codepoint U+0282.

F7AC Latin small letter u with dot below with diaeresis shows an
acute, not a diaeresis.

Lukas






Re: IPA for hard g

2002-12-14 Thread Lukas Pietsch

 What is the correct IPA symbol for the g sound in gig?
 Is it U+0067 LATIN SMALL LETTER G [g], or U+0261
 LATIN SMALL LETTER SCRIPT G [ɡ]?

 It seems obvious to me that it should be U+0261, but I'm looking for a
 voice of authority to confirm this.

I'm certainly no voice of authority, but perhaps you will accept as
such the _Handbook of the International Phonetic Association_, Cambridge
UP, 1999, p.163ff.

It states that the glyph represented in the standard IPA chart
(opentail g in IPA terminology) is to be encoded as U+0261, but that
Ascii g (looptail g) may be used as an equivalent.

As far as I can see, almost all professionally printed material that
uses IPA symbols (such as dictionaries etc) use the looptail g glyph,
i.e. U+0261.

Lukas






Re: Localized names of character ranges

2002-12-02 Thread Lukas Pietsch

 Just a question: has anyone who is concerned about these considered
 sending the suggestions to someone at Microsoft, where they might do
some
 good? It's nice to tell people on the Unicode list, but to have any
impact,
 Microsoft needs to be involved.


True enough. Sorry if I used up bandwidth for people not concerned with
this. I was hoping that someone with the right connections would be
around here. Not exactly easy for simple Joe User, like me, to find the
right address at Microsoft and get listened to, would it be?

Lukas






Localized names of character ranges

2002-12-01 Thread Lukas Pietsch
Hello,

I just wondered if anybody at Microsoft has noticed that the names of
the Unicode ranges used in German localized editions of MS Office are
woefully inadequately translated. It's been an long-standing cause of
irritation when working with Word97, and if I remember correctly it
hasn't been corrected so far, at least not in Word2000. I'm referring to
the names as they are used in the Insert-Symbol dialog.

Some of these mistranslations are really far off. To the average user,
they will just make no sense at all, but for people on this list they
may actually be quite funny. So, just for your enjoyment, here goes:

Spacing modifier letters has been translated as if it meant letters
that modify the spacing (Buchstaben zur Abstanddefinition). The
average user would probably expect to find things like em-space and
en-space in that range? Or has somebody succeeded in getting control
characters added to Unicode that encode some kind of kerning
information? W.O., perhaps? ;-)

In a similar vein, Alphabetic presentation forms have come out as
characters for alphabetic display. (Zeichen zur alphabetischen
Darstellung.) Same goes for the Arabic presentation forms.

Less severely, combining diacritical marks have been mistaken for
combined diacritical marks (kombinierte diakritische Kennzeichen).
What would you expect in such a range, things like Greek Dialytika
Tonos, or even precomposed letter combinations?

The same confusion about combining/combined goes for the combining
characters in the U+20Dx Combining Marks for Symbols range
(kombinierte diakritische Sonderzeichen). Also, the for Symbols part
has not been rendered at all, and the difference between Sonderzeichen
and Kennzeichen will probably not mean anything to the average user.

Finally, Georgian has been translated as Georgianisch (perhaps by
analogy with Gregorianisch?) instead of correct Georgisch.


Is there anybody here who could bring this to the attention of the
localization people at MS, if appropriate? I'd really hate having to use
Buchstaben zur Abstanddefinition for the next 20 years...

Just to be constructive, here's my suggestions for a better translation:

Spacing modifier letters = Nichtkombinierende Diakritika (I know
it's not very precise, but I couldn't come up with anything better)
Combining diacritical marks = Kombinierende Diakritika
Combining marks for symbols = Kombinierende Symbolzusätze
Alphabetic presentation forms = Alphabetische Präsentationsformen
Arabic presentation forms = Arabische Präsentationsformen
Georgian = Georgisch

Lukas







Re: Special characters

2002-11-05 Thread Lukas Pietsch
 Could someone tell me whether it is possible to produce
 the following characters please?

Sure:

k with a small line underneath
#x1E35; #7733; LATIN SMALL LETTER K WITH LINE BELOW
or: k#x0331;k#817;

K with a small line underneath
#x1E34; #7732; LATIN CAPITAL LETTER K WITH LINE BELOW
or: K#x0331;K#817;

H with a dot underneath
#x1E24; #7716; LATIN CAPITAL LETTER H WITH DOT BELOW
or: H#x0323;H#803;

h with a dot underneath
#x1E25; #7717; LATIN SMALL LETTER H WITH DOT BELOW
or: h#x0323;h#803;

B with a small line underneath
#x1E06; #7686; LATIN CAPITAL LETTER B WITH LINE BELOW
or: B#x0331;B#817;

b with a small line underneath
#x1E07; #7687; LATIN SMALL LETTER B WITH LINE BELOW
or: b#x0331; b#817;

D with a small line underneath
#x1E0E; #7694; LATIN CAPITAL LETTER D WITH LINE BELOW
or: D#x0331;D#817;

d with a small line underneath
#x1E0F; #7695; LATIN SMALL LETTER D WITH LINE BELOW
or: d#x0331;d#817;

G with a line on top
#x1E20; #7712; LATIN CAPITAL LETTER G WITH MACRON
or: G#x0304;G#772;

g with a line on top
#x1E21; LATIN SMALL LETTER G WITH MACRON
or: g#x0304;#7713;

E with an upside down ^ on top
#x011A; #282; LATIN CAPITAL LETTER E WITH CARON*

e with an upside down ^ on top
#x011B; #283; LATIN SMALL LETTER E WITH CARON*

Mirror image of a comma, but not at the bottom - should be higher, like
an '
#x02BD; #701; MODIFIER LETTER REVERSED COMMA (use as a letter);
or: #x201B; #8219; SINGLE HIGH-REVERSED-9 QUOTATION MARK* (use as a
punctuation mark);

The codes marked #x... are the hexadecimal Unicode values, those marked
#... are the decimal ones. You can use them in this form in html pages.
You will need a 'large' Unicode font for most of these - only the ones
marked with an * are found, for instance, in standard Windows Unicode
fonts such as Times New Roman. I suggest Gentium
(http://www.sil.org/~gaultney/gentium). The combining characters U+0304,
U+0331 and U+0323 are also in Lucida Sans Unicode and some other fonts.

Hope this helps,

Lukas







Re: Common input methods for IPA

2002-07-10 Thread Lukas Pietsch

Hi Marc Wilhelm,

 In this brave new world of wonderful input methods, what is
 the current state of affairs for keyboard-based input methods
 for characters from the IPA block? Is there any de facto standard
 for this and, for that matter, for an IPA keyboard layout?

I'm not aware of a standard either (maybe the Keyman keyboards
distributed by SIL for their SIL phonetic fonts come closest to one, in
terms
of widest distribution), but I guess any *international* standard would
be highly problematic anyway - any key mnemonics are bound to fail for
users that are
accustomed to one national keyboard and not another. Personally, I've
come to use my own home-grown Keyman method for Unicode IPA. Maybe I am
and shall remain the only person in this world who finds the layout
intuitive, but it works for me. Since it's based on the layout of a
German keyboard, you might find it worth having a look at:

http://people.freenet.de/LukasPietsch/Keyman/Keyboards.html

Hope this helps,

Lukas







Re: Recent Threats

2002-02-27 Thread Lukas Pietsch



 Would you by chance mean 'threads' ?

 There is a difference, you know ;-)

Quite right. And, in order to prove Stefan's point: how about starting a
new thread/threat now about why we Germans are so prone to confuse these
letters, and what consequences that ought to have for a possible
unification of these two characters in Unicode? Any takers?
;-)

Lukas






Re: Smiles, faces, etc

2002-02-16 Thread Lukas Pietsch

Falkor wrote:

 I was thinking more that this would allow modern software to translate
a
 lower-ASCII three-character sequence into a single unicode emoticon
 character that would be displayed properly regardless of OS and
software,
 also alleviating the need for such developers to create proprietary
artwork
 for each.  This multiple-keystroke-per-character input method does
have
 precedent with Asian languages.

I'm starting to wonder about this thread. Really, why would anybody want
to have the Ascii-smilies replaced by single standardized faces
created by some font designer? The creative process of composing these
smilies from their Ascii components, together with the the
open-endedness of the repertoire and the scope for creative variation
this involves - isn't that just the fun of the whole thing? The
playfulness? Isn't it exactly this what has made them so popular?

Lukas







Re: A few questions about decomposition, equvalence and rendering

2002-02-05 Thread Lukas Pietsch

John Cowan wrote:

 Eh?  U+1FC1 *is* nonspacing.  The U+1Fxx ones are the spacing
 compatibility equivalents, except for this one.


U+1FC1 is spacing in all the fonts that I've seen. And it decomposes to
U+00A8 U+0342 (canonically), i.e. to a sequence of spacing plus
non-spacing character. At least it did so in Unicode 3.0.

Not that I would bother much - I have no idea where that character
should ever be used.

Lukas Pietsch






Re: Unicode 3.1 and Roman numeral harmonic analysis

2001-07-18 Thread Lukas Pietsch

 Are the letters used in Roman numeral harmonic analysis  Roman 
 numerals or are some other letters also used ?

There are quite a number of different systems out there, but it's common to them all 
that they use some combination of Roman letters with numbers (often subscript or 
superscript), musical accidentals (flat / sharp signs); plus / minus / greater-than or 
smaller-than signs, and other graphical symbols such as strokes, brackets, circumflex 
accents...

Many systems (including the Schenkerian analysis that is fashionable in the 
Anglo-Saxon world) use capital Roman numerals as their base symbols  while other 
systems (such as the Riemannian analysis that is common in Germany) use letter 
combinations that stand for the harmonic functions: T, D, S, Tp, and so forth, 
including the double-dominant and double-subdominant symbols (partial overlay of 
two Ds or two Ss respectively.)

Do you need a few scans? 

Lukas





Re: Unicode transliterations (and other operations)

2001-07-04 Thread Lukas Pietsch

James Kass wrote:

 Indeed!  Or, at least if we need a correct definition of
 an English word, we should consult an English dictionary.
 The web page cited by Mr. Constable is simply misleading, unless
 it were to be amended to clearly state for the purposes of
 this and related documents... these words mean c.

well, the English dictionaries give usages of words in everyday language,
and that's fine. But in their usage as technical terms, the distinction between 
transcription
and transliteration (roughly along the lines of the 
http://www.elot.gr/tc46sc2/purpose.html page) seems to me to be a fairly 
well-established one, in the field of linguistics at least.

 No international body has any authority to alter the meaning of
 existing words in my language or any of our languages.

Sure, but we're dealing with a scholarly discipline's technical vocabulary here, and 
it's not such a bad idea in this case if computer people dealing with language adopt 
the usage of linguists, is it?

 what they call transliteration could easily be
 referred to as reversible transliteration in plain English,
 without 'breaking existing applications' like my dictionary.

You must understand: this isn't about breaking existing applications, it's about a 
higher-level protocol! ;-)


Lukas Pietsch





Re: translation help desired: symbols

2001-05-02 Thread Lukas Pietsch

Greek = ?
symbolo (symbolo)

Yes, but don't omit the accent:
σύμβολο, plural σύμβολα

(oh yes, and this *is* another UTF-8 message again, I couldn't help it.)

Lukas





RichEdit v.4 common control in Win98?

2001-04-07 Thread Lukas Pietsch

Hello,

from a recent posting by Peter Constable I take it that it is possible to
have Unicode keyboard input with Tavultesoft KeyMan 5.0 (using WM_UNICHAR
messages) in some applications under Win98, provided you have version 4 of
richedit20.dll installed, and that a new version of Wordpad supports this.
I have richedit20 v.3 on my system, which apparently came with IE5.5, and
it doesn't provide this functionality.

Questions:
(a) Is the new richedit control, and/or the new Wordpad version, available
for download somewhere?
(b) If you install the new version of richedit20.dll, does that actually
add the WM_UNICHAR functionality automatically to applications that
previously were using richedit20 v.3? (e.g. Outlook Express...?)
(c) Has anybody got a list of existing Win98 applications that can make use
of the WM_UNICHAR functionality?

Thanks,

Lukas


-
Lukas Pietsch
University of Freiburg
English Department

Phone (p.) (#49) (761) 696 37 23
mailto:[EMAIL PROTECTED]





Re: Classical Greek on a Mac

2001-04-04 Thread Lukas Pietsch

David Perry wrote:

 I have a polytonic keyboard for
 Windows that I have created using Keyman, which is not available for the
 Mac.  I'd be happy to share the documentation for this

Would you be willing to share the Keyman keyboard itself? I just downloaded
Keyman 5.0 after some people on this list told us what wonderful things it
can do. But I have no keyboards as yet to go with it.
Thanks ever so much,

Lukas





Re: Square and lozenge notes -- Musical Notation 3.1 -- Mensural notation

2001-03-07 Thread Lukas Pietsch

 All notes could have been given post-1420 names given the fact that the
 white notes appear only after 1420...

Well, not really, because there are quite a few symbols (black notes of
semibreve and above) which occur only in the pre-1420 notation. So the
series of "black" note names would have a confusing gap:

"black head with no stem" = "black semibrevis"
   = no "black minima" 
"black head with stem"  = "black semiminima" (new usage)
"black head with stem and flag1" = "black fusa" (new usage)
"black head with stem and flag2" = "black semifusa" (new usage)

"white head with no stem" = "white semibrevis"
"white head with stem" = "white minima"
"white head with stem and flag1" = "white semiminima"
"white head with stem and flag2" = "white fusa"
etc.

That's what your proposal boils down to, isn't it? Well, certainly
historically correct, but I find it even slightly more confusing than the
other way. I do think that the terminology Unicode has chosen is the more
consistent one. Confusing, yes, but it *will* be confusing to
non-specialist users either way, won't it?


 P.S. Incidentally, do your sources also show consistently the nominal
form
 of the MAXIMA and LONGA with stems pointing downwards contrarily to the
 Unicode reference glyph ?


Oops, indeed, they do, and I hadn't noticed. (As I said, my musicology days
at university are way back...) -- This might very well be significant. Yes,
I think mensural notation did not have the modern convention that the
orientation of the noteheads depends on the position on the stave. Hold on,
I'll check.

I also notice that the "black maxima" seems to be missing. Since we have
the "black" and "white" series, we ought to have them both complete, right?
"black longa" can be thougt of as unified with Gregorian 1d1d3 "virga", and
"black brevis" with generic 1d147 "square notehead black", but the "black
maxima" isn't there.

Lukas





Re: Square and lozenge notes -- Musical Notation 3.1 -- Mensural notation

2001-03-07 Thread Lukas Pietsch

In my last posting I wrote:
 I also notice that the "black maxima" seems to be missing. Since we
 have the "black" and "white" series, we ought to have them both
 complete, right? "black longa" can be thougt of as unified with
 Gregorian 1d1d3 "virga", and "black brevis" with generic
 1d147 "square notehead black", but the "black maxima" isn't there.

Patrick Andries has answered this point, suggesting that
the black and white variants should be seen as font variants.
I guess that's a valid point, but it raises the question why the other
musical notes aren't unified in the same way. There are separate characters
(1d1b9) "SEMIBREVIS WHITE" and (1d1ba)  "SEMIBREVIS BLACK". Note that these
symbols are *not* affected by the semantic ambiguity problem we were
discussing, which involves only the smaller note values minima, semiminima,
fusa and semifusa.
I'd be interested to learn the rationale behind these choices. Is the
original proposal available anywhere?

As for the other question, that of the stem of "longa" and "maxima": Yes,
Patrick's suggestion is right that the most common form of these notes has
a downwards stem (on the *right* side of the notehead, mind!) In earlier
mensural notation, the directions of noteheads did not depend on the
position of the notehead on the stave, as today; rather, minims and other
small notes always had upwards stems and single longae and maximae mostly
had downward stems. However, the odd example of longae with upward stems
can be found even then. From the mid-16th century onwards the modern
convention of context-dependend stems seems to have emerged, and from then
on both the longae and the minim stems were placed according to it. So, it
seems consistent that the Unicode charts show all notes with upward stems,
implying that upward and downward stems are context-dependend glyph
variants.

Plenty of examples of all this can be found in: Willi Apel, Die Notation
der polyphonen Musik 900-1600. Leipzig 1962/1970.


Lukas







Re: Latin digraph characters (was: Re: Klingon silliness)

2001-02-28 Thread Lukas Pietsch

Doug Ewell wrote:

 Aren't Serbian and Croatian the standard example of two "languages" that
are
 really the same language but are treated separately

This question about languages being "really" the same or no turns out to be
a rather moot one from a linguist's viewpoint, even more so once the issue
gets burdened with national feelings. I mean, are English and Scots the
same? Are Bulgarian and Macedonian the same? Are Rumanian and Aromunian the
same? Are Ancient Greek and Ancient Macedonian the same? Are Upper German
and Lower German the same? Are German, Schwitzerdytsch and Letzeburgsch the
same? Are Dutch and Flemish the same? Are British and American English the
same (that was an issue at one time!) -- There are probably as many such
issues as there are nations in the world, or more, and as a linguist you
get weary of getting asked what the "real" answer is in each case.

 Are there any linguistic or vocabulary differences between them?

Well, there are bound to be, at some level, and if not in the normative
standards, then in the actual spoken varieties of relevant population
centers. The question is just, how big are these, and--different and much
more important question--how big do people *want* to *perceive* these
differences to be?

Lukas

(P.S.: Sorry Doug, I meant to send this to the list in the first place.)




Re: Help with Greek special casing

2001-02-26 Thread Lukas Pietsch

Carl Brown asked:
 It is final when followed by a hyphen or combining diacritical mark?

Patrick Rourke answered:

 Don't know what the Unicode rules are, but the answer is no.  The final
 sigma form is not used if the sigma is in a medial position in the word
but
 at the end of the line (e.g., when it occurs at the point of hyphenation
in
 a hyphenated word at line end).

Just one addition: You do get a final sigma before explicit (hard) hyphens,
i.e. u+2020 and other kinds of dashes, as opposed to (soft) line-breaking
hyphens (u+00AD).
I guess explicit hyphenation isn't likely to occur in typesetting of
Ancient Greek, but it does occur in Modern Greek, in noun compounds of the
type κράτος-μέλος.
The Unicode rules will handle this correctly, as far as I can see.

Lukas





Re: Inverted breve in Greek?

2001-02-22 Thread Lukas Pietsch

Seán,

these are "perispomeni"s. Not uncommon to see them printed like that.
Encode as u+0342.

Best wishes,

Lukas







Re: What about musical notation?

2001-02-22 Thread Lukas Pietsch



 Am I right in thinking that in the days when hand set metal type on
printing
 presses was the only method of printing that there were fonts of musical
 type?  I have never seen any font of such type myself, though I have seen
 fonts for such non-text matters as chess sets and crossword puzzles.


As far as I know, music printing with mobile letters of this kind was
indeed done, mostly back in the 16th/17th century. There were "letters"
which each represented one fragment of a stave with one or several
noteheads on them. It tended to look pretty rough, though. Almost as if we
were to put staves together from ASCII characters like:

---o---
---|
---|
---|



High-quality printing since the mid-18th century has been done by engraving
or etching in metal plates, where the graphics are either first drawn by
hand on the metal surface, or applied to it with stamps of some sort.

Lukas




Re: extracting words

2001-01-29 Thread Lukas Pietsch


Christopher Fynn wrote:

BTW without determining the language as well as the script, how do you
propose to determine if a particular string actually matches a word in
your "blacklist" (in terms of meaning) or not? The same string of
characters might mean completely different things in two languages that
share the same script (/Unicode block).

This is assuming that what we want is not just a matching of
*orthographical* words (character strings), but of *lexicographical* words
(aka lexemes). Which of course brings with it even more problems. If you
want to filter out all occurrences of, say, a particular verb, you'll have
to look out for all possible grammatical forms of that verb. 5 forms at
maximum in English (go, goes, went, gone, going), but maybe several
hundreds in a heavily inflectional or agglutinative language. In some
languages the set of possible forms of a lexeme may even be open-ended. No
way of doing that without a full-blown morphological parser (which of
course would have to be language-specific.) Looks like this goes a bit
beyond what Brahim is planning to do.

Lukas





Re: Benefits of Unicode

2001-01-28 Thread Lukas Pietsch


 Francois M Richard wrote:
 
  Can Unicode conformance be applied to rtf (and how)?
 
Newer Microsoft products (from Office 97 onwards?) seem to use constructs
of the form
\u\'YY  to encode Unicode characters, where  is the *decimal*
Unicode value and YY is a replacement character in ANSI as an alternative
for non-Unicode-aware readers. The rtf source text itself is encoded in
7bit Ascii, and the codepage used to interpret the \'YY commands is
specified somewhere in a command in the header.
This is the method apparently used by many Windows applications
internally to exchange Unicode data, e.g. through the clipboard. Just save
a sample Word document with some Unicode characters to rtf to see how it
works.
There's more details on this somewhere
in the MSDN library, under "Specifications/Applications/Rich Text Format".


As for html, you can either embed Unicode character entities of the form
#; in an otherwise 8bit source text, or have the whole source text in
UTF-8 (This is probably rather over-simplified, I guess... :-)

Hope this helps,

Lukas








Re: Transcriptions of Unicode

2001-01-12 Thread Lukas Pietsch

Marco Cimarosti wrote:

 I don't fully agree with Mark Davis' API transcription of "Unicode":


http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U
_IPA.gif

Neither do I, but partly for different reasons.


 1) I think that IPA transcriptions should be in [square brackets], while
 phonemic transcriptions should be in /slashes/. If neither enclosing is
 present, the transcription is ambiguous.

Right. And that's actually part of the key to the problem's answer:

 2) AFAIK, the phoneme [o:] (a long version of "o" in "got") does not
exist
 in any standard pronunciation of contemporary English. It should rather
be
 the diphthong [ou] (where the [u] would probably better be U+028A).

In America, transcribing the vowel in "code" as /o/ (and "made" as /e/) is
not uncommon, at least in *phonemic* transcription. Generally, American
accents have less diphthongization in these sounds than British accents
have, and phonemically it makes sense to see these sounds as part of the
series of "long vowels". A *narrow phonetic* transcription would have
something like [u+006F u+028A] for American, and [u+0259 u+028A] for
British.

 3) The transcription shows the primary stress on the first syllable, and
a
 secondary stress on the last one. In the few occasions when I heard
native
 English speakers saying "Unicode", I had the impression that it rather
was
 the other way round.

I can't tell, because where I live I don't get to talk to native speakers
about Unicode a lot. But: According to standard word-formation and
pronunciation patterns in English, the stress pattern shown ('uni,code) is
absolutely what you'd expect: as in "uniform", "unisex", "unicorn",
"universe". (D. Jones, English Pronouncing Dictionary, doesn't even mark a
secondary stress on the third syllable at all.)

 4) As "Unicode" is the proper name of an international standard, and it
is
 built with two English roots of French origin, it could as well be
 considered a French word, which would lead to a totally different
 transcription.

Right, but this particular pattern of merging word roots into a new word
does suggest English provenance, I think. And, historically, that's where
it did come from.

But there's another inconsistency in the transcription: the vowels in the
first ("u-") and third ("-code") syllable are both phonemically long.
Either you put the length mark on both (recommended for *phonetic*
transcription), or on neither (okay with *phonemic* transcription). (Of
course, if you transcribe the third syllable as a diphthong then you won't
get a length mark there.)

According to the conventions in D. Jones, English Pronouncing Dictionary,
you'd get something like:

[u+02C8 u+006A u+0075 u+02D0 u+006E u+026A  u+006B u+0259 u+028A u+0064]

Lukas


-
Lukas Pietsch
University of Freiburg
English Department

Phone (p.) (#49) (761) 696 37 23
mailto:[EMAIL PROTECTED]




Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.

2000-12-07 Thread Lukas Pietsch

Michael Kaplan wrote:


  plus...
  dumb question 1.  Is Aramaic (which doesn't seem to have a 2 character
ISO
  code) the same as Amharic (which does...AM)?   If not, Amharic appears
to
 be
  a Semetic language too, is that written right-to-left too?

 Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic
 has no native speakers

As far as I know, there is still a (small) minority of speakers in Turkey
and Syria who speak the present-day descendant language of (biblical)
Aramaic. This present-day dialect is commonly called Aramaic too. I have
absolutely no idea what writing system, if any, they would use today
(although probably not the ancient Aramaic script? More likely Arabic?)

Lukas Pietsch




Greek Diacritics Again

2000-11-23 Thread Lukas Pietsch

Dear all,

there's another issue about Greek diacritics I'd like to ask the opinion of
the people who are in the know: the question of (monotonic) Greek "TONOS"
and (polytonic) Greek "OXIA" and their equivalence. I know this has had a
somewhat troublesome history in Unicode.

I seem to remember I read in some Unicode document that the Greek "TONOS"
could be realized *either* as an acute *or* as a vertical stroke. I can't
locate the reference at the moment. Unfortunately I haven't got the book at
hand here and I've been searching the website in vain. Is the standard
(still) actually saying this, or is my memory failing me?

On the other hand, the standard is of course quite unambiguous now about the
fact that the two accents are equivalent in principle. All the "Oxia"
codepoints in 1fxx are singletons (therefore deprecated?) and canonically
map to the corresponding "tonos" codepoints in 03xx.

Would it be fair to sum up the consequences of all this for font design in
the following way: If a font is designed for use with both monotonic and
polytonic Greek, then the "tonos" glyphs should *definitely* look like
acutes. If a font is designed for monotonic Greek only, a font designer can
choose to use either acutes or verticals (or any other shape, for that
matter: decorative typefaces in Greece are apparently using all sorts of
things from wedges to dots or squares...)
But can you think of any good reason for a font to have different (default)
glyphs for the "tonos" and for the "oxia" characters side by side?



Lukas Pietsch
Ferdinand-Kopf-Str. 11
D-79117 Freiburg
Tel. 0761-696 37 23

Universität Freiburg
Englisches Seminar




Open-Type Support (was: Greek Prosgegrammeni)

2000-11-22 Thread Lukas Pietsch

Dear all,

a lot was said in this thread about intelligent rendering mechanisms, such
as fonts implementing automatic glyph substitution and things like that. The
notion appears to be quite commonplace to the experts, whereas I (being an
amateur) must admit it seemed just like a utopic dream to me when I first
heard of the possibility of such a thing, a few months ago. I figure that
people are mostly thinking of the technology called "Open Type", is that
right?

Can anybody enlighten me about how much support for that technology is
already available in standard software, say, in browsers or text processors
under Windows 9x? If I had a True-Type font that implemented the glyph
substitutions, say, for the Greek combining diacritics, could I make my
average standard word processing software actually use these features? Or
would I have to wait for specialized multilingual word processors to appear
on the market?

I found the documentation of the "GetCharacterPlacement" function in the
Windows API. It looks like that was the place were these things should be
implemented system-wide. But I played with it a bit and found it didn't
actually do any glyph replacements. Is that function actually implemented in
Win98, or is it just a stub? Or did I make a mistake in my testing, or is
something wrong with my system? Can Win2000 do more than Win98 in this
respect?

I also noticed that MS Internet Explorer does use glyph replacement features
on my system when it is displaying Arabic. How does it do that? Would there
be a way of making it use other Open-Type features too?




Lukas Pietsch
Ferdinand-Kopf-Str. 11
D-79117 Freiburg
Tel. 0761-696 37 23

Universität Freiburg
Englisches Seminar




Re: Greek Prosgegrammeni

2000-11-22 Thread Lukas Pietsch

Thanks to Asmus and Kenneth for their clarifying comments. Things are
beginning to seem to make sense to me... (:-)

Especially, I'm quite relieved to see now that:
- for any one of the common printing variants of mute iota that a user might
want to see,
- there is already at least one easily available truetype font, so that
- even *without* special glyph shaping or glyph substitution mechanisms in
display,
- there will be at least one way of encoding that will be stable, in the
sense that it will guarantee the desired display and not get corrupted when
undergoing canonical composition/decmposition;
and, most importantly:
- all these encodings will be recognized as equivalent by Unicode
applications when it comes to case-insensitive matching (because all these
character sequences case-fold to the same sequence of vowel + small iota
(03B9)). That's something, isn't it?

What will *not* work, for most users, is automatic case *conversion*. This
will lead to undesired or unexpected results in most cases. But there are
other independent reasons for that anyway: For most users, correct
uppercasing also involves the stripping of accents and breathings, and the
Unicode casing rules don't provide for that either. But then again: who
wants to use automatic case conversion for polytonic Greek anyway? (I can
hardly remember having ever used it even in the Latin script in all the text
processing I've done.) People will simply be typing sequences that Unicode
will see as irregular mixed-case strings, but who cares? I guess all the
computational features that really matter to most of us common mortals (like
sorting, word searches etc.) involve the "case-folding" feature used for
case-insensitive matching, and as I said above, this seems to work out in a
fairly intuitive and sensible way.

So, after all, the UTC people do deserve a pat on the back for their good
work? (:-)

I have another ignorant layman's sort of question, but I'll put it into a
second message because it really consitutes a different topic.

Lukas





Re: Open-Type Support (was: Greek Prosgegrammeni)

2000-11-22 Thread Lukas Pietsch

John Hudson wrote:

 At present, polytonic Greek is not supported in Uniscribe,
 I suspect because no one has determined that it needs to be.

So, would you agree that it does need to be? Keeping in mind what Kenneth
Whistler wrote:

 Not if the fonts they use map capital letter + ypogegrammeni character
 combinations into capital letter + full-size iota glyph sequences.

 Of course, if the fonts they use are not designed for correct use with
 polytonic Greek, then the default rendering behavior of the ypogegrammeni
 will not be what they expect or want. Time to upgrade the fonts.
...
 This is not all that sophisticated. It should be a matter that can be
 wholly encapsulated within the fonts:

 Font IFont II

 A. 0397 0313 0345  ==  'H iota adscript  'H iota subscript

 B. 1F98==  'H iota adscript  'H iota subscript
 ...
 Many of us have felt all along that polytonic Greek should always be
 represented decomposed, and that the ELOT polytonic "character" encoding
 was a dangerous conflation of glyph design and character encoding
concerns.
...

 Implementations that use full decomposition for polytonic Greek and fonts
 that correctly map the accentual and diacritic combinations are the
 best bet for consistency *and* good presentation in the long run.


Mind that the case-mapping question we were discussing is just one minor
aspect of the issue; the main task is much more general, and at the same
time more straightforward (If we leave aside the issue of automatic case
conversion and the fancy problems of, let's say, small-caps): the decomposed
character sequences simply need to be mapped to the precomposed ones. It
affects not only the iota subscripts/adscripts but also all the other
diacritics. Without some glyph processing most combinations will never
display readably. Since the precomposed glyphs already exist as Unicode
codepoints, I suppose that the implementation would probably not even be
very difficult, and not much of it would even depend on the individual font,
would it?

By the way, I wouldn't agree with Kenneth that it wasn't a good idea to have
the precomposed characters in Unicode in the first place. I'm very glad they
are there, since, as we see, the beautiful smart rendering features we are
talking about are simply not yet available in mainstream text processing
software. Much as I like the idea of the projects such as "Graphite" that
Marco mentioned, I do think there are quite a number of people out here who
would love to be able to handle Greek comfortably in their everyday
all-purpose text-processing and browsing software. The precomposed
characters are at present the only means they have to do so on a Windows
platform. Adding smart rendering support for the decomposed characters would
provide them with a much better means; I'd certainly agree with Kenneth
about that. And I'd also think it would be preferable if that could be done
system-wide and not just by some individual application, wouldn't it? So it
seems as if Uniscribe looks like the best bet at the moment, for Windows
users.

What do the Microsoft people think? May we hope?



Lukas




Re: Greek Prosgegrammeni

2000-11-19 Thread Lukas Pietsch

Sorry I'm going on about this again, but I feel still puzzled, so bear with
me once more.

I'm not quite sure if Mark's answer solves my problem. I can see that the
case mappings and decompositions as defined in the charts are internally
contradiction-free, no problem so far. Only, there still seems to be a
mismatch between what the charts show and what users will probably expect to
see. Let me repeat: as far as I can gather, there are several different
typographical traditions, but roughly speaking there are two: In one
tradition, readers expect to see full-size, spacing glyphs for mute iotas
*both* in titlecase and in uppercase (usually a small iota glyph in
titlecase, a small or capital iota glyph in uppercase). In the other
tradition, readers expect to see smaller, diacritic-like glyphs (either
centered under, or near the right corner of, the base letter), again *both*
in titlecase and in uppercase. All the printing I've seen so far seems to
adhere either to the one major pattern or the other; they apparently don't
often get mixed. And as we've seen, many people who are used to the one
pattern aren't even aware that the other exists.

The Unicode charts, somehow arbitrarily, seem to dictate in favour of the
one tradition in the one case and of the second tradition in the other. In
titlecase you get some sort of a non-spacing diacritic, while in uppercase
you *must* use the full-size capital iota glyph. Users who want full-size
iota glyphs throughout will find it difficult to live with the decomposition
to u+0345 in titlecase, while users who want small diacritic glyphs
throughout will see no sense in the u+0399 (capital iota) in uppercase.
Without some *very* sophisticated rendering machine, neither group will be
able to get it all displayed to their taste. People will prefer encoding
their texts in ways deviating from the norm, rather sacrificing case
equivalence than what each of them will consider "correct" display.


Lukas

- Original Message -
From: "Mark Davis" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Sunday, November 19, 2000 8:18 PM
Subject: Re: Greek Prosgegrammeni


 I haven't had time to read this list recently, so here is a somewhat
belated
 response.

 But, even if you do so, we are left with a "wrong" canonical
decomposition:

 1FBC;GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI;Lt;0;L;0391
 0345N1FB3;

 According to James' statement (which is not totally supported by others,
 anyway), the decomposition should be U+0391 U+0399 (GREEK CAPITAL LETTER
 ALPHA + GREEK CAPITAL LETTER IOTA).

 Unfortunately, due to historical reasons the characters are misnamed. They
 should be named:

 GREEK TITLECASE LETTER ALPHA WITH PROSGEGRAMMENI, etc.

 However, we can't change the names. See
 http://www.unicode.org/unicode/standard/policies.html. We can add
 annotations.

 Notice that the general category is "Lt" = Titlecase letter, so despite
the
 name the character is the titlecase version. The decomposition is correct
 for that titlecase letter. The full case mapping, as provided in Unidata +
 SpecialCasing is also for the titlecase mapping (see
 http://www.unicode.org/unicode/reports/tr21/ ) You will also find that the
 combining ypogegrammeni cases correctly

 The uppercase mappings in Unidata alone are not sufficient for full case
 mapping, but are the best that can be done without changing string
lengths.
 For the full mapping, you have to use SpecialCasing.txt. You can see what
 results on
 http://www.unicode.org/unicode/reports/tr21/charts/CaseChart4.html (you'll
 need a font for the Greek characters). Search for 1FBC. You will find that
 it is the titlecase form. Some fonts will not show the 1FBC with the right
 iota, but you can see from its position in the chart what it should be.

  However, the precomposed characters containing the prosgegrammeni, e.g.
  "GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI" (u+1FBC) still
 canonically
  decompose to base letter + "COMBINING GREEK YPOGEGRAMMENI" (u+0345), as
if
  prosgegrammeni and ypogegrammeni were the same thing. This means that,
 even

 Those are the right decompositions (see

http://www.unicode.org/unicode/reports/tr15/charts/NormalizationChart17.html
 ), however, because the characters are misnamed it leads to confusion.

 Mark