Re: Letterforms based on p
I was hoping to find someone who had additional evidence for this character. I happened to come across it the other day in a modern printed edition of 17th- to 19th century handwritten English letters (Miller, Kerby A., Arnold Schrier, Bruce D. Boling, David N. Doyle. 2002. _Irish immigrants in the land of Canaan: letters and memoirs from colonial and revolutionary America, 1675-1815._ Oxford: Oxford UP) I haven't got it here just now, but if it is important I might be able to provide a few scans. If I remember correctly, it was being used just as an handwritten ligature of the word per, as in per day, per year, etc. Lukas
Re: book end or enclosingcharacters in most languages?
are they the right way round? so in german it'd be: otto said So, there is not comprehensive list of openers vs. closers possible. Does not look right here. The following is more like it: So, there is not comprehensive list of openers vs. closers possible. No, as far as I can tell the original version is the correct one. Look at it with a font other than Courier New, which has a rather uncommon glyph for the German closing quotes. Times New Roman is much more representative. Lukas
Re: symbols for `born' and `died' + guarani sign
For the married symbol use the mathematical infinity symbol: U+221E (no pun intended). Indeed, one could go a step further and introduce (?) a symbol for divorced: Either one of the following offers itself as a candidate: U+29DC INCOMPLETE INFINITY U+29DE INFINITY NEGATED WITH VERTICAL BAR Actually, the symbol and several others already exist and seem to be standardized. The Duden, the authoritative source on German orthography, describes them in its section on typesetting practices, under the heading of genealogical symbols. Besides the asterisk (born), the dagger/cross (died), and the two overlapping rings (married) it lists: wavy horizontal line (= baptized) a single ring (= engaged) two rings separated by a vertical bar (= U+29DE?) (= divorced) two rings joined by a horizontal line (= extramarital) two swords crossed (= died in combat) rectangle (= buried) urn symbol (= cremated) The married symbol, by the way, typically differs from the infinity symbol, as it consists of two overlapping circles, not just circles touching each other. The born and died symbol, on the other hand, are clearly identical with the normal typographical asterisk and dagger. The Duden also makes it clear that these are all for use in inline text (knnen in entsprechenden Texten zur Raumersparnis verwendet werden). I haven't got a scanner here, else I might put up a scan somewhere. I haven't found the time yet to look up if any more of these are already in Unicode, but I don't remember having seen them. Lukas
Re: LATIN LETTER N WITH DIAERESIS?
All characters are now mapped to Unicoe characters or character sequences where I felt that this was possible. If there are obvioous errors, please point them out and I'll update the listing. However, there are some unidentified characters, or ones that could be considered missing from Unicode 4.0, or which have mappings that for one or the other reason could be considered not ideal. These have been highlighted. I welcome suggestions for additions to or subtractions from this list, plus any help anyone could provide in identifying the characters or in locating places they are used. Your F725 Unknown-2, to me, looks like a German SCRIPT CAPITAL S, (compare with U+2112;SCRIPT CAPITAL L). Yes, we were taught to write an S like this in school. Perhaps it's used somewhere in mathematics? Your F7AA Unknown-8 could then be a SCRIPT CAPITAL C. Your F747, spacing left hook below - doesn't it look very much like the palatalization hooks used elsewhere in the list (which you mapped to U+0321)? Your combinations with latin small letter dotless i (e.g. F704, F731, F77A) seem to be designed for use in phonetic transcriptions, and hence are probably intended as IPA U+026A;LATIN LETTER SMALL CAPITAL I F737: the description in your list doesn't match the glyph shown, which is with triangular colon. F70F Latin small letter a with colon shows a triangular colon glyph and should hence be mapped to U+02D0, not U+003A. F70E Latin small letter a with tilde with modifier letter triangular colon shows a U+0251 Latin small letter alpha glyph. F750 Latin small letter i with palatalized hook below shows an inverted breve glyph, not a hook. F751 Latin small letter i with tilde with tilde shows a macron and a tilde F754 and F755 Latin small letter J with... show i, not j glyphs. F79B Latin small letter S with retroflex hook below shows not a retroflex hook, but something more like an ogonek. A retroflex hook should be attached to the left side of the S, not in the middle below, and has its own precomposed IPA codepoint U+0282. F7AC Latin small letter u with dot below with diaeresis shows an acute, not a diaeresis. Lukas
Re: IPA for hard g
What is the correct IPA symbol for the g sound in gig? Is it U+0067 LATIN SMALL LETTER G [g], or U+0261 LATIN SMALL LETTER SCRIPT G [ɡ]? It seems obvious to me that it should be U+0261, but I'm looking for a voice of authority to confirm this. I'm certainly no voice of authority, but perhaps you will accept as such the _Handbook of the International Phonetic Association_, Cambridge UP, 1999, p.163ff. It states that the glyph represented in the standard IPA chart (opentail g in IPA terminology) is to be encoded as U+0261, but that Ascii g (looptail g) may be used as an equivalent. As far as I can see, almost all professionally printed material that uses IPA symbols (such as dictionaries etc) use the looptail g glyph, i.e. U+0261. Lukas
Re: Localized names of character ranges
Just a question: has anyone who is concerned about these considered sending the suggestions to someone at Microsoft, where they might do some good? It's nice to tell people on the Unicode list, but to have any impact, Microsoft needs to be involved. True enough. Sorry if I used up bandwidth for people not concerned with this. I was hoping that someone with the right connections would be around here. Not exactly easy for simple Joe User, like me, to find the right address at Microsoft and get listened to, would it be? Lukas
Localized names of character ranges
Hello, I just wondered if anybody at Microsoft has noticed that the names of the Unicode ranges used in German localized editions of MS Office are woefully inadequately translated. It's been an long-standing cause of irritation when working with Word97, and if I remember correctly it hasn't been corrected so far, at least not in Word2000. I'm referring to the names as they are used in the Insert-Symbol dialog. Some of these mistranslations are really far off. To the average user, they will just make no sense at all, but for people on this list they may actually be quite funny. So, just for your enjoyment, here goes: Spacing modifier letters has been translated as if it meant letters that modify the spacing (Buchstaben zur Abstanddefinition). The average user would probably expect to find things like em-space and en-space in that range? Or has somebody succeeded in getting control characters added to Unicode that encode some kind of kerning information? W.O., perhaps? ;-) In a similar vein, Alphabetic presentation forms have come out as characters for alphabetic display. (Zeichen zur alphabetischen Darstellung.) Same goes for the Arabic presentation forms. Less severely, combining diacritical marks have been mistaken for combined diacritical marks (kombinierte diakritische Kennzeichen). What would you expect in such a range, things like Greek Dialytika Tonos, or even precomposed letter combinations? The same confusion about combining/combined goes for the combining characters in the U+20Dx Combining Marks for Symbols range (kombinierte diakritische Sonderzeichen). Also, the for Symbols part has not been rendered at all, and the difference between Sonderzeichen and Kennzeichen will probably not mean anything to the average user. Finally, Georgian has been translated as Georgianisch (perhaps by analogy with Gregorianisch?) instead of correct Georgisch. Is there anybody here who could bring this to the attention of the localization people at MS, if appropriate? I'd really hate having to use Buchstaben zur Abstanddefinition for the next 20 years... Just to be constructive, here's my suggestions for a better translation: Spacing modifier letters = Nichtkombinierende Diakritika (I know it's not very precise, but I couldn't come up with anything better) Combining diacritical marks = Kombinierende Diakritika Combining marks for symbols = Kombinierende Symbolzusätze Alphabetic presentation forms = Alphabetische Präsentationsformen Arabic presentation forms = Arabische Präsentationsformen Georgian = Georgisch Lukas
Re: Special characters
Could someone tell me whether it is possible to produce the following characters please? Sure: k with a small line underneath #x1E35; #7733; LATIN SMALL LETTER K WITH LINE BELOW or: k#x0331;k#817; K with a small line underneath #x1E34; #7732; LATIN CAPITAL LETTER K WITH LINE BELOW or: K#x0331;K#817; H with a dot underneath #x1E24; #7716; LATIN CAPITAL LETTER H WITH DOT BELOW or: H#x0323;H#803; h with a dot underneath #x1E25; #7717; LATIN SMALL LETTER H WITH DOT BELOW or: h#x0323;h#803; B with a small line underneath #x1E06; #7686; LATIN CAPITAL LETTER B WITH LINE BELOW or: B#x0331;B#817; b with a small line underneath #x1E07; #7687; LATIN SMALL LETTER B WITH LINE BELOW or: b#x0331; b#817; D with a small line underneath #x1E0E; #7694; LATIN CAPITAL LETTER D WITH LINE BELOW or: D#x0331;D#817; d with a small line underneath #x1E0F; #7695; LATIN SMALL LETTER D WITH LINE BELOW or: d#x0331;d#817; G with a line on top #x1E20; #7712; LATIN CAPITAL LETTER G WITH MACRON or: G#x0304;G#772; g with a line on top #x1E21; LATIN SMALL LETTER G WITH MACRON or: g#x0304;#7713; E with an upside down ^ on top #x011A; #282; LATIN CAPITAL LETTER E WITH CARON* e with an upside down ^ on top #x011B; #283; LATIN SMALL LETTER E WITH CARON* Mirror image of a comma, but not at the bottom - should be higher, like an ' #x02BD; #701; MODIFIER LETTER REVERSED COMMA (use as a letter); or: #x201B; #8219; SINGLE HIGH-REVERSED-9 QUOTATION MARK* (use as a punctuation mark); The codes marked #x... are the hexadecimal Unicode values, those marked #... are the decimal ones. You can use them in this form in html pages. You will need a 'large' Unicode font for most of these - only the ones marked with an * are found, for instance, in standard Windows Unicode fonts such as Times New Roman. I suggest Gentium (http://www.sil.org/~gaultney/gentium). The combining characters U+0304, U+0331 and U+0323 are also in Lucida Sans Unicode and some other fonts. Hope this helps, Lukas
Re: Common input methods for IPA
Hi Marc Wilhelm, In this brave new world of wonderful input methods, what is the current state of affairs for keyboard-based input methods for characters from the IPA block? Is there any de facto standard for this and, for that matter, for an IPA keyboard layout? I'm not aware of a standard either (maybe the Keyman keyboards distributed by SIL for their SIL phonetic fonts come closest to one, in terms of widest distribution), but I guess any *international* standard would be highly problematic anyway - any key mnemonics are bound to fail for users that are accustomed to one national keyboard and not another. Personally, I've come to use my own home-grown Keyman method for Unicode IPA. Maybe I am and shall remain the only person in this world who finds the layout intuitive, but it works for me. Since it's based on the layout of a German keyboard, you might find it worth having a look at: http://people.freenet.de/LukasPietsch/Keyman/Keyboards.html Hope this helps, Lukas
Re: Recent Threats
Would you by chance mean 'threads' ? There is a difference, you know ;-) Quite right. And, in order to prove Stefan's point: how about starting a new thread/threat now about why we Germans are so prone to confuse these letters, and what consequences that ought to have for a possible unification of these two characters in Unicode? Any takers? ;-) Lukas
Re: Smiles, faces, etc
Falkor wrote: I was thinking more that this would allow modern software to translate a lower-ASCII three-character sequence into a single unicode emoticon character that would be displayed properly regardless of OS and software, also alleviating the need for such developers to create proprietary artwork for each. This multiple-keystroke-per-character input method does have precedent with Asian languages. I'm starting to wonder about this thread. Really, why would anybody want to have the Ascii-smilies replaced by single standardized faces created by some font designer? The creative process of composing these smilies from their Ascii components, together with the the open-endedness of the repertoire and the scope for creative variation this involves - isn't that just the fun of the whole thing? The playfulness? Isn't it exactly this what has made them so popular? Lukas
Re: A few questions about decomposition, equvalence and rendering
John Cowan wrote: Eh? U+1FC1 *is* nonspacing. The U+1Fxx ones are the spacing compatibility equivalents, except for this one. U+1FC1 is spacing in all the fonts that I've seen. And it decomposes to U+00A8 U+0342 (canonically), i.e. to a sequence of spacing plus non-spacing character. At least it did so in Unicode 3.0. Not that I would bother much - I have no idea where that character should ever be used. Lukas Pietsch
Re: Unicode 3.1 and Roman numeral harmonic analysis
Are the letters used in Roman numeral harmonic analysis Roman numerals or are some other letters also used ? There are quite a number of different systems out there, but it's common to them all that they use some combination of Roman letters with numbers (often subscript or superscript), musical accidentals (flat / sharp signs); plus / minus / greater-than or smaller-than signs, and other graphical symbols such as strokes, brackets, circumflex accents... Many systems (including the Schenkerian analysis that is fashionable in the Anglo-Saxon world) use capital Roman numerals as their base symbols while other systems (such as the Riemannian analysis that is common in Germany) use letter combinations that stand for the harmonic functions: T, D, S, Tp, and so forth, including the double-dominant and double-subdominant symbols (partial overlay of two Ds or two Ss respectively.) Do you need a few scans? Lukas
Re: Unicode transliterations (and other operations)
James Kass wrote: Indeed! Or, at least if we need a correct definition of an English word, we should consult an English dictionary. The web page cited by Mr. Constable is simply misleading, unless it were to be amended to clearly state for the purposes of this and related documents... these words mean c. well, the English dictionaries give usages of words in everyday language, and that's fine. But in their usage as technical terms, the distinction between transcription and transliteration (roughly along the lines of the http://www.elot.gr/tc46sc2/purpose.html page) seems to me to be a fairly well-established one, in the field of linguistics at least. No international body has any authority to alter the meaning of existing words in my language or any of our languages. Sure, but we're dealing with a scholarly discipline's technical vocabulary here, and it's not such a bad idea in this case if computer people dealing with language adopt the usage of linguists, is it? what they call transliteration could easily be referred to as reversible transliteration in plain English, without 'breaking existing applications' like my dictionary. You must understand: this isn't about breaking existing applications, it's about a higher-level protocol! ;-) Lukas Pietsch
Re: translation help desired: symbols
Greek = ? symbolo (symbolo) Yes, but don't omit the accent: σύμβολο, plural σύμβολα (oh yes, and this *is* another UTF-8 message again, I couldn't help it.) Lukas
RichEdit v.4 common control in Win98?
Hello, from a recent posting by Peter Constable I take it that it is possible to have Unicode keyboard input with Tavultesoft KeyMan 5.0 (using WM_UNICHAR messages) in some applications under Win98, provided you have version 4 of richedit20.dll installed, and that a new version of Wordpad supports this. I have richedit20 v.3 on my system, which apparently came with IE5.5, and it doesn't provide this functionality. Questions: (a) Is the new richedit control, and/or the new Wordpad version, available for download somewhere? (b) If you install the new version of richedit20.dll, does that actually add the WM_UNICHAR functionality automatically to applications that previously were using richedit20 v.3? (e.g. Outlook Express...?) (c) Has anybody got a list of existing Win98 applications that can make use of the WM_UNICHAR functionality? Thanks, Lukas - Lukas Pietsch University of Freiburg English Department Phone (p.) (#49) (761) 696 37 23 mailto:[EMAIL PROTECTED]
Re: Classical Greek on a Mac
David Perry wrote: I have a polytonic keyboard for Windows that I have created using Keyman, which is not available for the Mac. I'd be happy to share the documentation for this Would you be willing to share the Keyman keyboard itself? I just downloaded Keyman 5.0 after some people on this list told us what wonderful things it can do. But I have no keyboards as yet to go with it. Thanks ever so much, Lukas
Re: Square and lozenge notes -- Musical Notation 3.1 -- Mensural notation
All notes could have been given post-1420 names given the fact that the white notes appear only after 1420... Well, not really, because there are quite a few symbols (black notes of semibreve and above) which occur only in the pre-1420 notation. So the series of "black" note names would have a confusing gap: "black head with no stem" = "black semibrevis" = no "black minima" "black head with stem" = "black semiminima" (new usage) "black head with stem and flag1" = "black fusa" (new usage) "black head with stem and flag2" = "black semifusa" (new usage) "white head with no stem" = "white semibrevis" "white head with stem" = "white minima" "white head with stem and flag1" = "white semiminima" "white head with stem and flag2" = "white fusa" etc. That's what your proposal boils down to, isn't it? Well, certainly historically correct, but I find it even slightly more confusing than the other way. I do think that the terminology Unicode has chosen is the more consistent one. Confusing, yes, but it *will* be confusing to non-specialist users either way, won't it? P.S. Incidentally, do your sources also show consistently the nominal form of the MAXIMA and LONGA with stems pointing downwards contrarily to the Unicode reference glyph ? Oops, indeed, they do, and I hadn't noticed. (As I said, my musicology days at university are way back...) -- This might very well be significant. Yes, I think mensural notation did not have the modern convention that the orientation of the noteheads depends on the position on the stave. Hold on, I'll check. I also notice that the "black maxima" seems to be missing. Since we have the "black" and "white" series, we ought to have them both complete, right? "black longa" can be thougt of as unified with Gregorian 1d1d3 "virga", and "black brevis" with generic 1d147 "square notehead black", but the "black maxima" isn't there. Lukas
Re: Square and lozenge notes -- Musical Notation 3.1 -- Mensural notation
In my last posting I wrote: I also notice that the "black maxima" seems to be missing. Since we have the "black" and "white" series, we ought to have them both complete, right? "black longa" can be thougt of as unified with Gregorian 1d1d3 "virga", and "black brevis" with generic 1d147 "square notehead black", but the "black maxima" isn't there. Patrick Andries has answered this point, suggesting that the black and white variants should be seen as font variants. I guess that's a valid point, but it raises the question why the other musical notes aren't unified in the same way. There are separate characters (1d1b9) "SEMIBREVIS WHITE" and (1d1ba) "SEMIBREVIS BLACK". Note that these symbols are *not* affected by the semantic ambiguity problem we were discussing, which involves only the smaller note values minima, semiminima, fusa and semifusa. I'd be interested to learn the rationale behind these choices. Is the original proposal available anywhere? As for the other question, that of the stem of "longa" and "maxima": Yes, Patrick's suggestion is right that the most common form of these notes has a downwards stem (on the *right* side of the notehead, mind!) In earlier mensural notation, the directions of noteheads did not depend on the position of the notehead on the stave, as today; rather, minims and other small notes always had upwards stems and single longae and maximae mostly had downward stems. However, the odd example of longae with upward stems can be found even then. From the mid-16th century onwards the modern convention of context-dependend stems seems to have emerged, and from then on both the longae and the minim stems were placed according to it. So, it seems consistent that the Unicode charts show all notes with upward stems, implying that upward and downward stems are context-dependend glyph variants. Plenty of examples of all this can be found in: Willi Apel, Die Notation der polyphonen Musik 900-1600. Leipzig 1962/1970. Lukas
Re: Latin digraph characters (was: Re: Klingon silliness)
Doug Ewell wrote: Aren't Serbian and Croatian the standard example of two "languages" that are really the same language but are treated separately This question about languages being "really" the same or no turns out to be a rather moot one from a linguist's viewpoint, even more so once the issue gets burdened with national feelings. I mean, are English and Scots the same? Are Bulgarian and Macedonian the same? Are Rumanian and Aromunian the same? Are Ancient Greek and Ancient Macedonian the same? Are Upper German and Lower German the same? Are German, Schwitzerdytsch and Letzeburgsch the same? Are Dutch and Flemish the same? Are British and American English the same (that was an issue at one time!) -- There are probably as many such issues as there are nations in the world, or more, and as a linguist you get weary of getting asked what the "real" answer is in each case. Are there any linguistic or vocabulary differences between them? Well, there are bound to be, at some level, and if not in the normative standards, then in the actual spoken varieties of relevant population centers. The question is just, how big are these, and--different and much more important question--how big do people *want* to *perceive* these differences to be? Lukas (P.S.: Sorry Doug, I meant to send this to the list in the first place.)
Re: Help with Greek special casing
Carl Brown asked: It is final when followed by a hyphen or combining diacritical mark? Patrick Rourke answered: Don't know what the Unicode rules are, but the answer is no. The final sigma form is not used if the sigma is in a medial position in the word but at the end of the line (e.g., when it occurs at the point of hyphenation in a hyphenated word at line end). Just one addition: You do get a final sigma before explicit (hard) hyphens, i.e. u+2020 and other kinds of dashes, as opposed to (soft) line-breaking hyphens (u+00AD). I guess explicit hyphenation isn't likely to occur in typesetting of Ancient Greek, but it does occur in Modern Greek, in noun compounds of the type κράτος-μέλος. The Unicode rules will handle this correctly, as far as I can see. Lukas
Re: Inverted breve in Greek?
Seán, these are "perispomeni"s. Not uncommon to see them printed like that. Encode as u+0342. Best wishes, Lukas
Re: What about musical notation?
Am I right in thinking that in the days when hand set metal type on printing presses was the only method of printing that there were fonts of musical type? I have never seen any font of such type myself, though I have seen fonts for such non-text matters as chess sets and crossword puzzles. As far as I know, music printing with mobile letters of this kind was indeed done, mostly back in the 16th/17th century. There were "letters" which each represented one fragment of a stave with one or several noteheads on them. It tended to look pretty rough, though. Almost as if we were to put staves together from ASCII characters like: ---o--- ---| ---| ---| High-quality printing since the mid-18th century has been done by engraving or etching in metal plates, where the graphics are either first drawn by hand on the metal surface, or applied to it with stamps of some sort. Lukas
Re: extracting words
Christopher Fynn wrote: BTW without determining the language as well as the script, how do you propose to determine if a particular string actually matches a word in your "blacklist" (in terms of meaning) or not? The same string of characters might mean completely different things in two languages that share the same script (/Unicode block). This is assuming that what we want is not just a matching of *orthographical* words (character strings), but of *lexicographical* words (aka lexemes). Which of course brings with it even more problems. If you want to filter out all occurrences of, say, a particular verb, you'll have to look out for all possible grammatical forms of that verb. 5 forms at maximum in English (go, goes, went, gone, going), but maybe several hundreds in a heavily inflectional or agglutinative language. In some languages the set of possible forms of a lexeme may even be open-ended. No way of doing that without a full-blown morphological parser (which of course would have to be language-specific.) Looks like this goes a bit beyond what Brahim is planning to do. Lukas
Re: Benefits of Unicode
Francois M Richard wrote: Can Unicode conformance be applied to rtf (and how)? Newer Microsoft products (from Office 97 onwards?) seem to use constructs of the form \u\'YY to encode Unicode characters, where is the *decimal* Unicode value and YY is a replacement character in ANSI as an alternative for non-Unicode-aware readers. The rtf source text itself is encoded in 7bit Ascii, and the codepage used to interpret the \'YY commands is specified somewhere in a command in the header. This is the method apparently used by many Windows applications internally to exchange Unicode data, e.g. through the clipboard. Just save a sample Word document with some Unicode characters to rtf to see how it works. There's more details on this somewhere in the MSDN library, under "Specifications/Applications/Rich Text Format". As for html, you can either embed Unicode character entities of the form #; in an otherwise 8bit source text, or have the whole source text in UTF-8 (This is probably rather over-simplified, I guess... :-) Hope this helps, Lukas
Re: Transcriptions of Unicode
Marco Cimarosti wrote: I don't fully agree with Mark Davis' API transcription of "Unicode": http://my.ispchannel.com/~markdavis//unicode/Unicode_transcription_images/U _IPA.gif Neither do I, but partly for different reasons. 1) I think that IPA transcriptions should be in [square brackets], while phonemic transcriptions should be in /slashes/. If neither enclosing is present, the transcription is ambiguous. Right. And that's actually part of the key to the problem's answer: 2) AFAIK, the phoneme [o:] (a long version of "o" in "got") does not exist in any standard pronunciation of contemporary English. It should rather be the diphthong [ou] (where the [u] would probably better be U+028A). In America, transcribing the vowel in "code" as /o/ (and "made" as /e/) is not uncommon, at least in *phonemic* transcription. Generally, American accents have less diphthongization in these sounds than British accents have, and phonemically it makes sense to see these sounds as part of the series of "long vowels". A *narrow phonetic* transcription would have something like [u+006F u+028A] for American, and [u+0259 u+028A] for British. 3) The transcription shows the primary stress on the first syllable, and a secondary stress on the last one. In the few occasions when I heard native English speakers saying "Unicode", I had the impression that it rather was the other way round. I can't tell, because where I live I don't get to talk to native speakers about Unicode a lot. But: According to standard word-formation and pronunciation patterns in English, the stress pattern shown ('uni,code) is absolutely what you'd expect: as in "uniform", "unisex", "unicorn", "universe". (D. Jones, English Pronouncing Dictionary, doesn't even mark a secondary stress on the third syllable at all.) 4) As "Unicode" is the proper name of an international standard, and it is built with two English roots of French origin, it could as well be considered a French word, which would lead to a totally different transcription. Right, but this particular pattern of merging word roots into a new word does suggest English provenance, I think. And, historically, that's where it did come from. But there's another inconsistency in the transcription: the vowels in the first ("u-") and third ("-code") syllable are both phonemically long. Either you put the length mark on both (recommended for *phonetic* transcription), or on neither (okay with *phonemic* transcription). (Of course, if you transcribe the third syllable as a diphthong then you won't get a length mark there.) According to the conventions in D. Jones, English Pronouncing Dictionary, you'd get something like: [u+02C8 u+006A u+0075 u+02D0 u+006E u+026A u+006B u+0259 u+028A u+0064] Lukas - Lukas Pietsch University of Freiburg English Department Phone (p.) (#49) (761) 696 37 23 mailto:[EMAIL PROTECTED]
Re: OT (Kind of): Determining whether Locales are left-to-right or right-to-left.
Michael Kaplan wrote: plus... dumb question 1. Is Aramaic (which doesn't seem to have a 2 character ISO code) the same as Amharic (which does...AM)? If not, Amharic appears to be a Semetic language too, is that written right-to-left too? Amharic uses the Ethiopic script, and is not RTL as far a I know. Aramaic has no native speakers As far as I know, there is still a (small) minority of speakers in Turkey and Syria who speak the present-day descendant language of (biblical) Aramaic. This present-day dialect is commonly called Aramaic too. I have absolutely no idea what writing system, if any, they would use today (although probably not the ancient Aramaic script? More likely Arabic?) Lukas Pietsch
Greek Diacritics Again
Dear all, there's another issue about Greek diacritics I'd like to ask the opinion of the people who are in the know: the question of (monotonic) Greek "TONOS" and (polytonic) Greek "OXIA" and their equivalence. I know this has had a somewhat troublesome history in Unicode. I seem to remember I read in some Unicode document that the Greek "TONOS" could be realized *either* as an acute *or* as a vertical stroke. I can't locate the reference at the moment. Unfortunately I haven't got the book at hand here and I've been searching the website in vain. Is the standard (still) actually saying this, or is my memory failing me? On the other hand, the standard is of course quite unambiguous now about the fact that the two accents are equivalent in principle. All the "Oxia" codepoints in 1fxx are singletons (therefore deprecated?) and canonically map to the corresponding "tonos" codepoints in 03xx. Would it be fair to sum up the consequences of all this for font design in the following way: If a font is designed for use with both monotonic and polytonic Greek, then the "tonos" glyphs should *definitely* look like acutes. If a font is designed for monotonic Greek only, a font designer can choose to use either acutes or verticals (or any other shape, for that matter: decorative typefaces in Greece are apparently using all sorts of things from wedges to dots or squares...) But can you think of any good reason for a font to have different (default) glyphs for the "tonos" and for the "oxia" characters side by side? Lukas Pietsch Ferdinand-Kopf-Str. 11 D-79117 Freiburg Tel. 0761-696 37 23 Universität Freiburg Englisches Seminar
Open-Type Support (was: Greek Prosgegrammeni)
Dear all, a lot was said in this thread about intelligent rendering mechanisms, such as fonts implementing automatic glyph substitution and things like that. The notion appears to be quite commonplace to the experts, whereas I (being an amateur) must admit it seemed just like a utopic dream to me when I first heard of the possibility of such a thing, a few months ago. I figure that people are mostly thinking of the technology called "Open Type", is that right? Can anybody enlighten me about how much support for that technology is already available in standard software, say, in browsers or text processors under Windows 9x? If I had a True-Type font that implemented the glyph substitutions, say, for the Greek combining diacritics, could I make my average standard word processing software actually use these features? Or would I have to wait for specialized multilingual word processors to appear on the market? I found the documentation of the "GetCharacterPlacement" function in the Windows API. It looks like that was the place were these things should be implemented system-wide. But I played with it a bit and found it didn't actually do any glyph replacements. Is that function actually implemented in Win98, or is it just a stub? Or did I make a mistake in my testing, or is something wrong with my system? Can Win2000 do more than Win98 in this respect? I also noticed that MS Internet Explorer does use glyph replacement features on my system when it is displaying Arabic. How does it do that? Would there be a way of making it use other Open-Type features too? Lukas Pietsch Ferdinand-Kopf-Str. 11 D-79117 Freiburg Tel. 0761-696 37 23 Universität Freiburg Englisches Seminar
Re: Greek Prosgegrammeni
Thanks to Asmus and Kenneth for their clarifying comments. Things are beginning to seem to make sense to me... (:-) Especially, I'm quite relieved to see now that: - for any one of the common printing variants of mute iota that a user might want to see, - there is already at least one easily available truetype font, so that - even *without* special glyph shaping or glyph substitution mechanisms in display, - there will be at least one way of encoding that will be stable, in the sense that it will guarantee the desired display and not get corrupted when undergoing canonical composition/decmposition; and, most importantly: - all these encodings will be recognized as equivalent by Unicode applications when it comes to case-insensitive matching (because all these character sequences case-fold to the same sequence of vowel + small iota (03B9)). That's something, isn't it? What will *not* work, for most users, is automatic case *conversion*. This will lead to undesired or unexpected results in most cases. But there are other independent reasons for that anyway: For most users, correct uppercasing also involves the stripping of accents and breathings, and the Unicode casing rules don't provide for that either. But then again: who wants to use automatic case conversion for polytonic Greek anyway? (I can hardly remember having ever used it even in the Latin script in all the text processing I've done.) People will simply be typing sequences that Unicode will see as irregular mixed-case strings, but who cares? I guess all the computational features that really matter to most of us common mortals (like sorting, word searches etc.) involve the "case-folding" feature used for case-insensitive matching, and as I said above, this seems to work out in a fairly intuitive and sensible way. So, after all, the UTC people do deserve a pat on the back for their good work? (:-) I have another ignorant layman's sort of question, but I'll put it into a second message because it really consitutes a different topic. Lukas
Re: Open-Type Support (was: Greek Prosgegrammeni)
John Hudson wrote: At present, polytonic Greek is not supported in Uniscribe, I suspect because no one has determined that it needs to be. So, would you agree that it does need to be? Keeping in mind what Kenneth Whistler wrote: Not if the fonts they use map capital letter + ypogegrammeni character combinations into capital letter + full-size iota glyph sequences. Of course, if the fonts they use are not designed for correct use with polytonic Greek, then the default rendering behavior of the ypogegrammeni will not be what they expect or want. Time to upgrade the fonts. ... This is not all that sophisticated. It should be a matter that can be wholly encapsulated within the fonts: Font IFont II A. 0397 0313 0345 == 'H iota adscript 'H iota subscript B. 1F98== 'H iota adscript 'H iota subscript ... Many of us have felt all along that polytonic Greek should always be represented decomposed, and that the ELOT polytonic "character" encoding was a dangerous conflation of glyph design and character encoding concerns. ... Implementations that use full decomposition for polytonic Greek and fonts that correctly map the accentual and diacritic combinations are the best bet for consistency *and* good presentation in the long run. Mind that the case-mapping question we were discussing is just one minor aspect of the issue; the main task is much more general, and at the same time more straightforward (If we leave aside the issue of automatic case conversion and the fancy problems of, let's say, small-caps): the decomposed character sequences simply need to be mapped to the precomposed ones. It affects not only the iota subscripts/adscripts but also all the other diacritics. Without some glyph processing most combinations will never display readably. Since the precomposed glyphs already exist as Unicode codepoints, I suppose that the implementation would probably not even be very difficult, and not much of it would even depend on the individual font, would it? By the way, I wouldn't agree with Kenneth that it wasn't a good idea to have the precomposed characters in Unicode in the first place. I'm very glad they are there, since, as we see, the beautiful smart rendering features we are talking about are simply not yet available in mainstream text processing software. Much as I like the idea of the projects such as "Graphite" that Marco mentioned, I do think there are quite a number of people out here who would love to be able to handle Greek comfortably in their everyday all-purpose text-processing and browsing software. The precomposed characters are at present the only means they have to do so on a Windows platform. Adding smart rendering support for the decomposed characters would provide them with a much better means; I'd certainly agree with Kenneth about that. And I'd also think it would be preferable if that could be done system-wide and not just by some individual application, wouldn't it? So it seems as if Uniscribe looks like the best bet at the moment, for Windows users. What do the Microsoft people think? May we hope? Lukas
Re: Greek Prosgegrammeni
Sorry I'm going on about this again, but I feel still puzzled, so bear with me once more. I'm not quite sure if Mark's answer solves my problem. I can see that the case mappings and decompositions as defined in the charts are internally contradiction-free, no problem so far. Only, there still seems to be a mismatch between what the charts show and what users will probably expect to see. Let me repeat: as far as I can gather, there are several different typographical traditions, but roughly speaking there are two: In one tradition, readers expect to see full-size, spacing glyphs for mute iotas *both* in titlecase and in uppercase (usually a small iota glyph in titlecase, a small or capital iota glyph in uppercase). In the other tradition, readers expect to see smaller, diacritic-like glyphs (either centered under, or near the right corner of, the base letter), again *both* in titlecase and in uppercase. All the printing I've seen so far seems to adhere either to the one major pattern or the other; they apparently don't often get mixed. And as we've seen, many people who are used to the one pattern aren't even aware that the other exists. The Unicode charts, somehow arbitrarily, seem to dictate in favour of the one tradition in the one case and of the second tradition in the other. In titlecase you get some sort of a non-spacing diacritic, while in uppercase you *must* use the full-size capital iota glyph. Users who want full-size iota glyphs throughout will find it difficult to live with the decomposition to u+0345 in titlecase, while users who want small diacritic glyphs throughout will see no sense in the u+0399 (capital iota) in uppercase. Without some *very* sophisticated rendering machine, neither group will be able to get it all displayed to their taste. People will prefer encoding their texts in ways deviating from the norm, rather sacrificing case equivalence than what each of them will consider "correct" display. Lukas - Original Message - From: "Mark Davis" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Sunday, November 19, 2000 8:18 PM Subject: Re: Greek Prosgegrammeni I haven't had time to read this list recently, so here is a somewhat belated response. But, even if you do so, we are left with a "wrong" canonical decomposition: 1FBC;GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI;Lt;0;L;0391 0345N1FB3; According to James' statement (which is not totally supported by others, anyway), the decomposition should be U+0391 U+0399 (GREEK CAPITAL LETTER ALPHA + GREEK CAPITAL LETTER IOTA). Unfortunately, due to historical reasons the characters are misnamed. They should be named: GREEK TITLECASE LETTER ALPHA WITH PROSGEGRAMMENI, etc. However, we can't change the names. See http://www.unicode.org/unicode/standard/policies.html. We can add annotations. Notice that the general category is "Lt" = Titlecase letter, so despite the name the character is the titlecase version. The decomposition is correct for that titlecase letter. The full case mapping, as provided in Unidata + SpecialCasing is also for the titlecase mapping (see http://www.unicode.org/unicode/reports/tr21/ ) You will also find that the combining ypogegrammeni cases correctly The uppercase mappings in Unidata alone are not sufficient for full case mapping, but are the best that can be done without changing string lengths. For the full mapping, you have to use SpecialCasing.txt. You can see what results on http://www.unicode.org/unicode/reports/tr21/charts/CaseChart4.html (you'll need a font for the Greek characters). Search for 1FBC. You will find that it is the titlecase form. Some fonts will not show the 1FBC with the right iota, but you can see from its position in the chart what it should be. However, the precomposed characters containing the prosgegrammeni, e.g. "GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI" (u+1FBC) still canonically decompose to base letter + "COMBINING GREEK YPOGEGRAMMENI" (u+0345), as if prosgegrammeni and ypogegrammeni were the same thing. This means that, even Those are the right decompositions (see http://www.unicode.org/unicode/reports/tr15/charts/NormalizationChart17.html ), however, because the characters are misnamed it leads to confusion. Mark