Re: displaying Unicode text (was Re: Transcriptions of "Unicode")
Thanks! I appreciate the description. My fears were unfounded. > This states that *for each character* in the element, the implementation > is supposed to go down the list of fonts in the font-family property, to > find a font that exists and that contains a glyph for the current > character. I agree that this does not produce the optimal results, since one should have the freedom to select different fonts based on the context of the character. The above description is much better than a very coarse-grained approach (like having the entire document or element in the same font), but needs some more wriggle-room to allow people flexibility. Mark - Original Message - From: "Erik van der Poel" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: "Unicode List" <[EMAIL PROTECTED]> Sent: Thursday, December 07, 2000 00:30 Subject: Re: displaying Unicode text (was Re: Transcriptions of "Unicode") > Mark Davis wrote: > > > > Let's take an example. > > > > - The page is UTF-8. > > - It contains a mixture of German, dingbats and Hindi text. > > - My locale is de_DE. > > > > From your description, it sounds like Modzilla works as follows: > > > > - The locale maps (I'm guessing) to 8859-1 > > - 8859 maps to, say Helvetica. > > - The dingbats and Hindi appear as boxes or question marks. > > > > This would be pretty lame, so I hope I misunderstand you!! > > Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes, > you've misunderstood me, but only because I abbreviated so much. Sorry. > Let me try again, with more feeling this time. > > Using the example above: > > - The locale maps to "x-western" (ja_JP would map to "ja", so I've > prepended "x-" for the "language groups" that don't exist in RFC 1766) > > - x-western and CSS' sans-serif map to Arial > > - The dingbats appear as dingbats if they are in Unicode and at least > one of the dingbat fonts on the system has a Unicode cmap subtable > (WingDings is a "symbol" font, so it doesn't have such a table), while > the Hindi might display OK on some Windows systems if they have Hindi > support (Mozilla itself does not support any Indic languages yet). > > We could support the WingDings font if we add an entry for WingDings to > the following table: > > http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp #872 > > We just haven't done that yet. > > Basically, Mozilla will look at all the fonts on the system to find one > that contains a glyph for the current character. > > The language group and user locale stuff that I mentioned earlier is > only one part of the process -- the part that deals with the user's font > preferences. I'll explain more of the rest of the process: > > Mozilla implements CSS2's font matching algorithm: > > http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm > > This states that *for each character* in the element, the implementation > is supposed to go down the list of fonts in the font-family property, to > find a font that exists and that contains a glyph for the current > character. Mozilla implements this algorithm to the letter, which means > that fonts are chosen for each character without regard for neighboring > characters (unlike MSIE). This may actually have been a bad decision, > since we sometimes end up with text that looks odd due to font changes. > > Anyway, Mozilla's algorithm has the following steps: > > 1. "User-Defined" font > 2. CSS font-family property > 3. CSS generic font (e.g. serif) > 4. list of all fonts on system > 5. transliteration > 6. question mark > > You can see these steps in the following pieces of code: > > http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp #2642 > > http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#310 8 > > 1. "User-Defined" font (FindUserDefinedFont) > > We decided to include the User-Defined font functionality in Netscape 6 > again. It is similar to the old Netscape 4.X. Basically, if the user > selects this encoding from the View menu, then the browser passes the > bytes through to the font, untouched. This is for charsets that we don't > already support. This step needs to be the first step, since it > overrides everything else. > > 2. CSS font-family property (FindLocalFont) > > If the user hasn't selected User-Defined, we invoke this routine. It > simply goes down the font-family list to find a font that exists and > that contains a glyph for the current charac
Re: displaying Unicode text (was Re: Transcriptions of "Unicode")
Much of what Erik discussed below is also in the Unicode 17 presentation of mine called "International Features of New Netscape 6 Browser", a paper version of which you will find in the proceedings of IUC 17. (Section 4.4). The PP presentation (outline only) is available at: http://home.netscape.com/eng/intl/docs/iuc17/browser/iuc17browser.html See section 4.4. - Kat Erik van der Poel wrote: > Mark Davis wrote: > >> Let's take an example. >> >> - The page is UTF-8. >> - It contains a mixture of German, dingbats and Hindi text. >> - My locale is de_DE. >> >> From your description, it sounds like Modzilla works as follows: >> >> - The locale maps (I'm guessing) to 8859-1 >> - 8859 maps to, say Helvetica. >> - The dingbats and Hindi appear as boxes or question marks. >> >> This would be pretty lame, so I hope I misunderstand you!! > > > Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes, > you've misunderstood me, but only because I abbreviated so much. Sorry. > Let me try again, with more feeling this time. > > Using the example above: > > - The locale maps to "x-western" (ja_JP would map to "ja", so I've > prepended "x-" for the "language groups" that don't exist in RFC 1766) > > - x-western and CSS' sans-serif map to Arial > > - The dingbats appear as dingbats if they are in Unicode and at least > one of the dingbat fonts on the system has a Unicode cmap subtable > (WingDings is a "symbol" font, so it doesn't have such a table), while > the Hindi might display OK on some Windows systems if they have Hindi > support (Mozilla itself does not support any Indic languages yet). > > We could support the WingDings font if we add an entry for WingDings to > the following table: > > http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872 > > We just haven't done that yet. > > Basically, Mozilla will look at all the fonts on the system to find one > that contains a glyph for the current character. > > The language group and user locale stuff that I mentioned earlier is > only one part of the process -- the part that deals with the user's font > preferences. I'll explain more of the rest of the process: > > Mozilla implements CSS2's font matching algorithm: > > http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm > > This states that *for each character* in the element, the implementation > is supposed to go down the list of fonts in the font-family property, to > find a font that exists and that contains a glyph for the current > character. Mozilla implements this algorithm to the letter, which means > that fonts are chosen for each character without regard for neighboring > characters (unlike MSIE). This may actually have been a bad decision, > since we sometimes end up with text that looks odd due to font changes. > > Anyway, Mozilla's algorithm has the following steps: > > 1. "User-Defined" font > 2. CSS font-family property > 3. CSS generic font (e.g. serif) > 4. list of all fonts on system > 5. transliteration > 6. question mark > > You can see these steps in the following pieces of code: > > http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642 > > http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108 > > 1. "User-Defined" font (FindUserDefinedFont) > > We decided to include the User-Defined font functionality in Netscape 6 > again. It is similar to the old Netscape 4.X. Basically, if the user > selects this encoding from the View menu, then the browser passes the > bytes through to the font, untouched. This is for charsets that we don't > already support. This step needs to be the first step, since it > overrides everything else. > > 2. CSS font-family property (FindLocalFont) > > If the user hasn't selected User-Defined, we invoke this routine. It > simply goes down the font-family list to find a font that exists and > that contains a glyph for the current character. E.g.: > > font-family: Arial, "MS Gothic", sans-serif; > > 3. CSS generic font (FindGenericFont) > > If the above fails, this routine tries to find a font for the CSS > generic (e.g. sans-serif) that was found in the font-family property, if > any, otherwise it falls back to the user's default (serif or > sans-serif). This is where the font preferences come in, so this is > where we try to determine the language group of the element. I.e. we > take the LANG attribute of this element or a parent element if any, > otherwise the language group of the document's charset, if > non-Unicode-based, otherwise the user's locale's language group. > > 4. list of all fonts on system (FindGlobalFont) > > If the above fails, this routine goes through all fonts on the system, > trying to find one that contains a glyph for the current character. > > 5. transliteration (FindSubstituteFont) > > If we still can't find a font for this character, we try a > transliteration table. For example, the euro is mapped to the 3 ASCIIs >
Re: displaying Unicode text (was Re: Transcriptions of "Unicode")
Mark Davis wrote: > > Let's take an example. > > - The page is UTF-8. > - It contains a mixture of German, dingbats and Hindi text. > - My locale is de_DE. > > From your description, it sounds like Modzilla works as follows: > > - The locale maps (I'm guessing) to 8859-1 > - 8859 maps to, say Helvetica. > - The dingbats and Hindi appear as boxes or question marks. > > This would be pretty lame, so I hope I misunderstand you!! Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes, you've misunderstood me, but only because I abbreviated so much. Sorry. Let me try again, with more feeling this time. Using the example above: - The locale maps to "x-western" (ja_JP would map to "ja", so I've prepended "x-" for the "language groups" that don't exist in RFC 1766) - x-western and CSS' sans-serif map to Arial - The dingbats appear as dingbats if they are in Unicode and at least one of the dingbat fonts on the system has a Unicode cmap subtable (WingDings is a "symbol" font, so it doesn't have such a table), while the Hindi might display OK on some Windows systems if they have Hindi support (Mozilla itself does not support any Indic languages yet). We could support the WingDings font if we add an entry for WingDings to the following table: http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872 We just haven't done that yet. Basically, Mozilla will look at all the fonts on the system to find one that contains a glyph for the current character. The language group and user locale stuff that I mentioned earlier is only one part of the process -- the part that deals with the user's font preferences. I'll explain more of the rest of the process: Mozilla implements CSS2's font matching algorithm: http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm This states that *for each character* in the element, the implementation is supposed to go down the list of fonts in the font-family property, to find a font that exists and that contains a glyph for the current character. Mozilla implements this algorithm to the letter, which means that fonts are chosen for each character without regard for neighboring characters (unlike MSIE). This may actually have been a bad decision, since we sometimes end up with text that looks odd due to font changes. Anyway, Mozilla's algorithm has the following steps: 1. "User-Defined" font 2. CSS font-family property 3. CSS generic font (e.g. serif) 4. list of all fonts on system 5. transliteration 6. question mark You can see these steps in the following pieces of code: http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642 http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108 1. "User-Defined" font (FindUserDefinedFont) We decided to include the User-Defined font functionality in Netscape 6 again. It is similar to the old Netscape 4.X. Basically, if the user selects this encoding from the View menu, then the browser passes the bytes through to the font, untouched. This is for charsets that we don't already support. This step needs to be the first step, since it overrides everything else. 2. CSS font-family property (FindLocalFont) If the user hasn't selected User-Defined, we invoke this routine. It simply goes down the font-family list to find a font that exists and that contains a glyph for the current character. E.g.: font-family: Arial, "MS Gothic", sans-serif; 3. CSS generic font (FindGenericFont) If the above fails, this routine tries to find a font for the CSS generic (e.g. sans-serif) that was found in the font-family property, if any, otherwise it falls back to the user's default (serif or sans-serif). This is where the font preferences come in, so this is where we try to determine the language group of the element. I.e. we take the LANG attribute of this element or a parent element if any, otherwise the language group of the document's charset, if non-Unicode-based, otherwise the user's locale's language group. 4. list of all fonts on system (FindGlobalFont) If the above fails, this routine goes through all fonts on the system, trying to find one that contains a glyph for the current character. 5. transliteration (FindSubstituteFont) If we still can't find a font for this character, we try a transliteration table. For example, the euro is mapped to the 3 ASCIIs "EUR", which is useful on some Unix systems that don't have the euro glyph yet. Actually, this transliteration step isn't even implemented on Windows yet. 6. question mark (FindSubstituteFont) If we can't find a transliteration, we fall back to the last resort -- the good ol' question mark. That's it. I hope I didn't abbreviate too much this time! Erik
Re: displaying Unicode text (was re: Transcriptions of "Unicode")
John H. Jenkins wrote: > At 3:57 PM -0800 12/6/00, James Kass wrote: > >A Universal Character Set should not require mark-up/tags. > > Au contraire, it's been implicit in the design of Unicode from the > beginning that markup/tags would be required in certain situations. > Because of the 65536 character limitation ? (Which no longer applies.) > >If the Japanese version of a Chinese character looks different > >than the Chinese character, it *is* different. In many cases, > >"variant" does not mean "same". > > But as a rule, the Japanese and Chinese would disagree with you here. > Certainly the IRG would disagree. Few in the west would argue over > the fundamental unity of Fraktur and Roman variations of the Latin > alphabet; most of the Chinese/Japanese variations are on that order > or less. > As our Asian friends come on-line, they will hopefully contribute to the discussion in this regard. The reason I suspect that the Japanese would tend to agree is that Unicode had not been widely accepted by the Japanese user community. Perhaps if Unicode originated elsewhere, we would have had to deal with Greek/Latin/Cyrillic unification? (And we could say that since the "W" is really a ligature of two "V"s, it shouldn't have an explicit encoding...) > > > >When limited to BMP code points, CJK unification kind of made > >sense. In light of the new additional planes... > > > >The IRG seems to be doing a fine job. > > > > Here you've really lost me. The IRG is unifying in plane 2, as well. > Nobody in the IRG has suggested that we abandon unification for plane > 2. > I tried to respond to this in an earlier letter. We don't even have CJK unification in the BMP, witness the blocks U+8A00 to U+8B9f versus U+8BA0 to U+8C36. Many of the characters in the latter block are simplified versions of the former. U+8A02/U+8BA2 U+8A03/U+8BA3 U+8A0C/U+8BA7 U+8A41/U+8BC2 etc. Fraktur and roman are both adaptations of the Latin script, or stylistic variations just as italic and roman. The Japanese writing system is Japanese, but derived from Chinese. As you say, some of the differences are minimal, perhaps slight variation in stroke order, but other differences are substantial. In some cases, the Japanese version may use a variant of a certain radical component, or even a different radical. I said I think the IRG is doing a fine job because it is such a monumental task, much progress is being made, and the results of their work seem to reflect the expectations of the various user communities involved. Best regards, James Kass.