Re: displaying Unicode text (was Re: Transcriptions of Unicode)

2000-12-07 Thread Erik van der Poel

Mark Davis wrote:
 
 Let's take an example.
 
 - The page is UTF-8.
 - It contains a mixture of German, dingbats and Hindi text.
 - My locale is de_DE.
 
 From your description, it sounds like Modzilla works as follows:
 
 - The locale maps (I'm guessing) to 8859-1
 - 8859 maps to, say Helvetica.
 - The dingbats and Hindi appear as boxes or question marks.
 
 This would be pretty lame, so I hope I misunderstand you!!

Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes,
you've misunderstood me, but only because I abbreviated so much. Sorry.
Let me try again, with more feeling this time.

Using the example above:

- The locale maps to "x-western" (ja_JP would map to "ja", so I've
prepended "x-" for the "language groups" that don't exist in RFC 1766)

- x-western and CSS' sans-serif map to Arial

- The dingbats appear as dingbats if they are in Unicode and at least
one of the dingbat fonts on the system has a Unicode cmap subtable
(WingDings is a "symbol" font, so it doesn't have such a table), while
the Hindi might display OK on some Windows systems if they have Hindi
support (Mozilla itself does not support any Indic languages yet).

We could support the WingDings font if we add an entry for WingDings to
the following table:

http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872

We just haven't done that yet.

Basically, Mozilla will look at all the fonts on the system to find one
that contains a glyph for the current character.

The language group and user locale stuff that I mentioned earlier is
only one part of the process -- the part that deals with the user's font
preferences. I'll explain more of the rest of the process:

Mozilla implements CSS2's font matching algorithm:

  http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm

This states that *for each character* in the element, the implementation
is supposed to go down the list of fonts in the font-family property, to
find a font that exists and that contains a glyph for the current
character. Mozilla implements this algorithm to the letter, which means
that fonts are chosen for each character without regard for neighboring
characters (unlike MSIE). This may actually have been a bad decision,
since we sometimes end up with text that looks odd due to font changes.

Anyway, Mozilla's algorithm has the following steps:

1. "User-Defined" font
2. CSS font-family property
3. CSS generic font (e.g. serif)
4. list of all fonts on system
5. transliteration
6. question mark

You can see these steps in the following pieces of code:

http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642

http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108

1. "User-Defined" font (FindUserDefinedFont)

We decided to include the User-Defined font functionality in Netscape 6
again. It is similar to the old Netscape 4.X. Basically, if the user
selects this encoding from the View menu, then the browser passes the
bytes through to the font, untouched. This is for charsets that we don't
already support. This step needs to be the first step, since it
overrides everything else.

2. CSS font-family property (FindLocalFont)

If the user hasn't selected User-Defined, we invoke this routine. It
simply goes down the font-family list to find a font that exists and
that contains a glyph for the current character. E.g.:

  font-family: Arial, "MS Gothic", sans-serif;

3. CSS generic font (FindGenericFont)

If the above fails, this routine tries to find a font for the CSS
generic (e.g. sans-serif) that was found in the font-family property, if
any, otherwise it falls back to the user's default (serif or
sans-serif). This is where the font preferences come in, so this is
where we try to determine the language group of the element. I.e. we
take the LANG attribute of this element or a parent element if any,
otherwise the language group of the document's charset, if
non-Unicode-based, otherwise the user's locale's language group.

4. list of all fonts on system (FindGlobalFont)

If the above fails, this routine goes through all fonts on the system,
trying to find one that contains a glyph for the current character.

5. transliteration (FindSubstituteFont)

If we still can't find a font for this character, we try a
transliteration table. For example, the euro is mapped to the 3 ASCIIs
"EUR", which is useful on some Unix systems that don't have the euro
glyph yet. Actually, this transliteration step isn't even implemented on
Windows yet.

6. question mark (FindSubstituteFont)

If we can't find a transliteration, we fall back to the last resort --
the good ol' question mark.

That's it. I hope I didn't abbreviate too much this time!

Erik



Re: Transcriptions of Unicode

2000-12-06 Thread Erik van der Poel

James Kass wrote:
 
 Erik van der Poel wrote:
 
  The font selection is indeed somewhat haphazard for CJK when there are
  no LANG attributes and the charset doesn't tell us anything either, but
  then, what do you expect in that situation anyway? I suppose we could
  deduce that the language is Japanese for Hiragana and Katakana, but what
  should we do about ideographs? Don't tell me the browser has to start
  guessing the language for those characters. I've had enough of the
  guessing game. We have been doing it for charsets for years, and it has
  led to trouble that we can't back out of now. I think we need to draw
  the line here, and tell Web page authors to mark their pages with LANG
  attributes or with particular fonts, preferrably in style sheets.
 
 A Universal Character Set should not require mark-up/tags.
 
 If the Japanese version of a Chinese character looks different
 than the Chinese character, it *is* different.  In many cases,
 "variant" does not mean "same".

I was referring to the CJK Unified Ideagraphs in the range U+4E00 to
U+9FA5. I agree that those codes do not *require* mark-up/tags, but if
the author wishes to have them displayed with a "Japanese font", then
they must indicate the language or specify the font directly. The latter
may be problematic. I don't think it's reasonable to expect a browser to
apply various heuristics to determine the language.

 When limited to BMP code points, CJK unification kind of made
 sense.  In light of the new additional planes...
 
 The IRG seems to be doing a fine job.

Somehow I get the impression that you have more to say, but you just
aren't saying it. Cough it up already. :-)

Erik



Re: Transcriptions of Unicode

2000-12-04 Thread Erik van der Poel

Mark Davis wrote:
 
 What wasn't clear from his message
 is whether Mozilla picks a reasonable font if the language is not there.

Sorry about the lack of clarity. When there is no LANG attribute in the
element (or in a parent element), Mozilla uses the document's charset as
a fallback. Mozilla has font preferences for each language group. The
language groups have been set up to have a one-to-one correspondence
with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja.
When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses
the language group that contains the user's locale's language.

In other words, Mozilla does not (yet) use the Unicode character codes
to select fonts. We may do this in the future.

Erik



Re: Transcriptions of Unicode

2000-12-04 Thread Erik van der Poel

Mark Davis wrote:
 
 What wasn't clear from his message
 is whether Mozilla picks a reasonable font if the language is not there.

Sorry about the lack of clarity. When there is no LANG attribute in the
element (or in a parent element), Mozilla uses the document's charset as
a fallback. Mozilla has font preferences for each language group. The
language groups have been set up to have a one-to-one correspondence
with charsets (roughly). E.g. iso-8859-1 - Western, shift_jis - ja.
When the charset is a Unicode-based one (e.g. UTF-8), then Mozilla uses
the language group that contains the user's locale's language.

In other words, Mozilla does not (yet) use the Unicode character codes
to select fonts. We may do this in the future.

Erik



Re: Japanese characters in Netscape 4.7 (was: embedding in HTML)

2000-11-28 Thread Erik van der Poel

Otto Stolz wrote:
 
 If you have the right fonts, and they display well in IE 5.5 but
 show empty boxes in Netscape Communicator 4.7, then I suspect
 you have not properly set up the latter. The font setup procedures
 differ substantially between IE 5 and NC 4 (I haven't tried NC 6,
 so far) to an extent that you easily can get the NC settings wrong
 if you are used to the IE procedures.

By the way, it's called Netscape 6. We dropped the name "Communicator".

 If you select the Japanese font for a Japanese encoding, then only
 those files that are transferred using JIS encoding will be properly
 displayed.

using one of the JIS-based encodings (i.e. Shift_JIS, EUC-JP or
ISO-2022-JP)

 In contrast, IE 5.5 lets you choose the font per script, regardless
 of the transfer encoding used. I think, this is more useful in most
 situations.

Actually, IE lets you choose per script for most scripts, but also
allows per language choices for the Han script (i.e. CJK (Simplified and
Traditional Chinese, Japanese and Korean)).

Netscape 6 has moved towards this too. Instead of per-script, we have
per "language group" (although somebody screwed up the UI and misnamed
it Language Encoding). The language groups basically map one-to-one to
charsets (character encodings), but documents in Unicode-based encodings
will be displayed with fonts that have been set for the language group
of the locale (if there are no LANG attributes in the document). I.e. we
now use the LANG attribute (if any), and fall back to the charset if
there is no LANG attribute.

Erik



Re: Japanese characters in Netscape 4.7 (was: embedding in HTML)

2000-11-28 Thread Erik van der Poel

Jungshik Shin wrote:
 
 I guess your diagnosis is correct for NS 4.x for MS-Windows.  FYI, NS 6
 (and Mozilla) for MS-Windows has more or less the same font selection
 mechanism as MS IE 4.x/5.x for MS-Windows and Netscape 4.x/6.x/Mozilla
 for Unix/X11. (I forgot what Netscape 4.x for MacOS does).

Netscape Navigator/Communicator 4.X for the Mac also performs "font
switching" ("font linking" in MS-speak) for documents in Unicode-based
encodings. I.e. similar to UnixNav4 and Win16Nav4, but unlike Win32Nav4.

Erik



Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Erik van der Poel

Frank da Cruz wrote:
 
 Doug Ewell wrote:
  
  That last paragraph echoes what Frank said about "reversing the layers,"
  performing the UTF-8 conversion first and then looking for escape
  sequences.  True UTF-8 support, in terminal emulators and in other
  software as well, really should depend on UTF-8 conversion being
  performed first.
 
 The irony is, when using ISO 2022 character-set designation and invocation,
 you have to handle the escape sequences first to know if you're in UTF-8.
 Therefore, this pushes the burden onto the end-user to preconfigure their
 emulator for UTF-8 if that is what is being used, when ideally this should
 happen automatically and transparently.

I may be misunderstanding the above, but ISO 2022 says:

  ESC 2/5 F shall mean that the other coding system uses
  ESC 2/5 4/0 to return;

  ESC 2/5 2/15 F shall mean that the other coding system
  does not use ESC 2/5 4/0 to return (it may have an alternative
  means to return or none at all).

Registration number 196 is for UTF-8 without implementation level, and
its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed
that way so that a decoder that does not know UTF-8 (or any other coding
system invoked by ESC 2/5 F) could simply "skip" the octets in that
encoding until it gets to the octets ESC 2/5 4/0.

This means that it does not need to decode UTF-8 just to find the escape
sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters
below U+0080 anyway (they're just single-byte ASCII), so it works, no?

Of course, if you wanted to include any C1 controls inside the UTF-8
segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is
entirely in the ASCII range (less than 128), so those octets are encoded
as is.

Erik



Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Erik van der Poel

Frank da Cruz wrote:
 
 Yes, but I was thinking more about the ISO 2022 invocation features than the
 designation ones:  LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls.
 The situation *could* arise where these would be used prior to announcing
 (or switching to) UTF-8.  In this case, the end-user would have to configure
 the software in advance to know whether the incoming byte stream is UTF-8.

Shouldn't the UTF-8 segment switch back to ISO 2022 before invoking any
of those C1 controls? This way, the decoder wouldn't have to know UTF-8,
and could skip over it reliably.

Erik