Re: displaying Unicode text (was Re: Transcriptions of Unicode)
Mark Davis wrote: Let's take an example. - The page is UTF-8. - It contains a mixture of German, dingbats and Hindi text. - My locale is de_DE. From your description, it sounds like Modzilla works as follows: - The locale maps (I'm guessing) to 8859-1 - 8859 maps to, say Helvetica. - The dingbats and Hindi appear as boxes or question marks. This would be pretty lame, so I hope I misunderstand you!! Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes, you've misunderstood me, but only because I abbreviated so much. Sorry. Let me try again, with more feeling this time. Using the example above: - The locale maps to "x-western" (ja_JP would map to "ja", so I've prepended "x-" for the "language groups" that don't exist in RFC 1766) - x-western and CSS' sans-serif map to Arial - The dingbats appear as dingbats if they are in Unicode and at least one of the dingbat fonts on the system has a Unicode cmap subtable (WingDings is a "symbol" font, so it doesn't have such a table), while the Hindi might display OK on some Windows systems if they have Hindi support (Mozilla itself does not support any Indic languages yet). We could support the WingDings font if we add an entry for WingDings to the following table: http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872 We just haven't done that yet. Basically, Mozilla will look at all the fonts on the system to find one that contains a glyph for the current character. The language group and user locale stuff that I mentioned earlier is only one part of the process -- the part that deals with the user's font preferences. I'll explain more of the rest of the process: Mozilla implements CSS2's font matching algorithm: http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm This states that *for each character* in the element, the implementation is supposed to go down the list of fonts in the font-family property, to find a font that exists and that contains a glyph for the current character. Mozilla implements this algorithm to the letter, which means that fonts are chosen for each character without regard for neighboring characters (unlike MSIE). This may actually have been a bad decision, since we sometimes end up with text that looks odd due to font changes. Anyway, Mozilla's algorithm has the following steps: 1. "User-Defined" font 2. CSS font-family property 3. CSS generic font (e.g. serif) 4. list of all fonts on system 5. transliteration 6. question mark You can see these steps in the following pieces of code: http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642 http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108 1. "User-Defined" font (FindUserDefinedFont) We decided to include the User-Defined font functionality in Netscape 6 again. It is similar to the old Netscape 4.X. Basically, if the user selects this encoding from the View menu, then the browser passes the bytes through to the font, untouched. This is for charsets that we don't already support. This step needs to be the first step, since it overrides everything else. 2. CSS font-family property (FindLocalFont) If the user hasn't selected User-Defined, we invoke this routine. It simply goes down the font-family list to find a font that exists and that contains a glyph for the current character. E.g.: font-family: Arial, "MS Gothic", sans-serif; 3. CSS generic font (FindGenericFont) If the above fails, this routine tries to find a font for the CSS generic (e.g. sans-serif) that was found in the font-family property, if any, otherwise it falls back to the user's default (serif or sans-serif). This is where the font preferences come in, so this is where we try to determine the language group of the element. I.e. we take the LANG attribute of this element or a parent element if any, otherwise the language group of the document's charset, if non-Unicode-based, otherwise the user's locale's language group. 4. list of all fonts on system (FindGlobalFont) If the above fails, this routine goes through all fonts on the system, trying to find one that contains a glyph for the current character. 5. transliteration (FindSubstituteFont) If we still can't find a font for this character, we try a transliteration table. For example, the euro is mapped to the 3 ASCIIs "EUR", which is useful on some Unix systems that don't have the euro glyph yet. Actually, this transliteration step isn't even implemented on Windows yet. 6. question mark (FindSubstituteFont) If we can't find a transliteration, we fall back to the last resort -- the good ol' question mark. That's it. I hope I didn't abbreviate too much this time! Erik
Re: Transcriptions of Unicode
James Kass wrote: Erik van der Poel wrote: The font selection is indeed somewhat haphazard for CJK when there are no LANG attributes and the charset doesn't tell us anything either, but then, what do you expect in that situation anyway? I suppose we could deduce that the language is Japanese for Hiragana and Katakana, but what should we do about ideographs? Don't tell me the browser has to start guessing the language for those characters. I've had enough of the guessing game. We have been doing it for charsets for years, and it has led to trouble that we can't back out of now. I think we need to draw the line here, and tell Web page authors to mark their pages with LANG attributes or with particular fonts, preferrably in style sheets. A Universal Character Set should not require mark-up/tags. If the Japanese version of a Chinese character looks different than the Chinese character, it *is* different. In many cases, "variant" does not mean "same". I was referring to the CJK Unified Ideagraphs in the range U+4E00 to U+9FA5. I agree that those codes do not *require* mark-up/tags, but if the author wishes to have them displayed with a "Japanese font", then they must indicate the language or specify the font directly. The latter may be problematic. I don't think it's reasonable to expect a browser to apply various heuristics to determine the language. When limited to BMP code points, CJK unification kind of made sense. In light of the new additional planes... The IRG seems to be doing a fine job. Somehow I get the impression that you have more to say, but you just aren't saying it. Cough it up already. :-) Erik
Re: Transcriptions of Unicode
Mark Davis wrote: What wasn't clear from his message is whether Mozilla picks a reasonable font if the language is not there. Sorry about the lack of clarity. When there is no LANG attribute in the element (or in a parent element), Mozilla uses the document's charset as a fallback. Mozilla has font preferences for each language group. The language groups have been set up to have a one-to-one correspondence with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja. When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses the language group that contains the user's locale's language. In other words, Mozilla does not (yet) use the Unicode character codes to select fonts. We may do this in the future. Erik
Re: Transcriptions of Unicode
Mark Davis wrote: What wasn't clear from his message is whether Mozilla picks a reasonable font if the language is not there. Sorry about the lack of clarity. When there is no LANG attribute in the element (or in a parent element), Mozilla uses the document's charset as a fallback. Mozilla has font preferences for each language group. The language groups have been set up to have a one-to-one correspondence with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja. When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses the language group that contains the user's locale's language. In other words, Mozilla does not (yet) use the Unicode character codes to select fonts. We may do this in the future. Erik
Re: Japanese characters in Netscape 4.7 (was: embedding in HTML)
Otto Stolz wrote: If you have the right fonts, and they display well in IE 5.5 but show empty boxes in Netscape Communicator 4.7, then I suspect you have not properly set up the latter. The font setup procedures differ substantially between IE 5 and NC 4 (I haven't tried NC 6, so far) to an extent that you easily can get the NC settings wrong if you are used to the IE procedures. By the way, it's called Netscape 6. We dropped the name "Communicator". If you select the Japanese font for a Japanese encoding, then only those files that are transferred using JIS encoding will be properly displayed. using one of the JIS-based encodings (i.e. Shift_JIS, EUC-JP or ISO-2022-JP) In contrast, IE 5.5 lets you choose the font per script, regardless of the transfer encoding used. I think, this is more useful in most situations. Actually, IE lets you choose per script for most scripts, but also allows per language choices for the Han script (i.e. CJK (Simplified and Traditional Chinese, Japanese and Korean)). Netscape 6 has moved towards this too. Instead of per-script, we have per "language group" (although somebody screwed up the UI and misnamed it Language Encoding). The language groups basically map one-to-one to charsets (character encodings), but documents in Unicode-based encodings will be displayed with fonts that have been set for the language group of the locale (if there are no LANG attributes in the document). I.e. we now use the LANG attribute (if any), and fall back to the charset if there is no LANG attribute. Erik
Re: Japanese characters in Netscape 4.7 (was: embedding in HTML)
Jungshik Shin wrote: I guess your diagnosis is correct for NS 4.x for MS-Windows. FYI, NS 6 (and Mozilla) for MS-Windows has more or less the same font selection mechanism as MS IE 4.x/5.x for MS-Windows and Netscape 4.x/6.x/Mozilla for Unix/X11. (I forgot what Netscape 4.x for MacOS does). Netscape Navigator/Communicator 4.X for the Mac also performs "font switching" ("font linking" in MS-speak) for documents in Unicode-based encodings. I.e. similar to UnixNav4 and Win16Nav4, but unlike Win32Nav4. Erik
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Doug Ewell wrote: That last paragraph echoes what Frank said about "reversing the layers," performing the UTF-8 conversion first and then looking for escape sequences. True UTF-8 support, in terminal emulators and in other software as well, really should depend on UTF-8 conversion being performed first. The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8 if that is what is being used, when ideally this should happen automatically and transparently. I may be misunderstanding the above, but ISO 2022 says: ESC 2/5 F shall mean that the other coding system uses ESC 2/5 4/0 to return; ESC 2/5 2/15 F shall mean that the other coding system does not use ESC 2/5 4/0 to return (it may have an alternative means to return or none at all). Registration number 196 is for UTF-8 without implementation level, and its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed that way so that a decoder that does not know UTF-8 (or any other coding system invoked by ESC 2/5 F) could simply "skip" the octets in that encoding until it gets to the octets ESC 2/5 4/0. This means that it does not need to decode UTF-8 just to find the escape sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters below U+0080 anyway (they're just single-byte ASCII), so it works, no? Of course, if you wanted to include any C1 controls inside the UTF-8 segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is entirely in the ASCII range (less than 128), so those octets are encoded as is. Erik
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Yes, but I was thinking more about the ISO 2022 invocation features than the designation ones: LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls. The situation *could* arise where these would be used prior to announcing (or switching to) UTF-8. In this case, the end-user would have to configure the software in advance to know whether the incoming byte stream is UTF-8. Shouldn't the UTF-8 segment switch back to ISO 2022 before invoking any of those C1 controls? This way, the decoder wouldn't have to know UTF-8, and could skip over it reliably. Erik