Re: displaying Unicode text (was Re: Transcriptions of "Unicode")

2000-12-07 Thread Mark Davis

Thanks! I appreciate the description. My fears were unfounded.

> This states that *for each character* in the element, the implementation
> is supposed to go down the list of fonts in the font-family property, to
> find a font that exists and that contains a glyph for the current
> character.

I agree that this does not produce the optimal results, since one should
have the freedom to select different fonts based on the context of the
character. The above description is much better than a very coarse-grained
approach (like having the entire document or element in the same font), but
needs some more wriggle-room to allow people flexibility.

Mark

- Original Message -
From: "Erik van der Poel" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, December 07, 2000 00:30
Subject: Re: displaying Unicode text (was Re: Transcriptions of "Unicode")


> Mark Davis wrote:
> >
> > Let's take an example.
> >
> > - The page is UTF-8.
> > - It contains a mixture of German, dingbats and Hindi text.
> > - My locale is de_DE.
> >
> > From your description, it sounds like Modzilla works as follows:
> >
> > - The locale maps (I'm guessing) to 8859-1
> > - 8859 maps to, say Helvetica.
> > - The dingbats and Hindi appear as boxes or question marks.
> >
> > This would be pretty lame, so I hope I misunderstand you!!
>
> Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes,
> you've misunderstood me, but only because I abbreviated so much. Sorry.
> Let me try again, with more feeling this time.
>
> Using the example above:
>
> - The locale maps to "x-western" (ja_JP would map to "ja", so I've
> prepended "x-" for the "language groups" that don't exist in RFC 1766)
>
> - x-western and CSS' sans-serif map to Arial
>
> - The dingbats appear as dingbats if they are in Unicode and at least
> one of the dingbat fonts on the system has a Unicode cmap subtable
> (WingDings is a "symbol" font, so it doesn't have such a table), while
> the Hindi might display OK on some Windows systems if they have Hindi
> support (Mozilla itself does not support any Indic languages yet).
>
> We could support the WingDings font if we add an entry for WingDings to
> the following table:
>
>
http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp
#872
>
> We just haven't done that yet.
>
> Basically, Mozilla will look at all the fonts on the system to find one
> that contains a glyph for the current character.
>
> The language group and user locale stuff that I mentioned earlier is
> only one part of the process -- the part that deals with the user's font
> preferences. I'll explain more of the rest of the process:
>
> Mozilla implements CSS2's font matching algorithm:
>
>   http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm
>
> This states that *for each character* in the element, the implementation
> is supposed to go down the list of fonts in the font-family property, to
> find a font that exists and that contains a glyph for the current
> character. Mozilla implements this algorithm to the letter, which means
> that fonts are chosen for each character without regard for neighboring
> characters (unlike MSIE). This may actually have been a bad decision,
> since we sometimes end up with text that looks odd due to font changes.
>
> Anyway, Mozilla's algorithm has the following steps:
>
> 1. "User-Defined" font
> 2. CSS font-family property
> 3. CSS generic font (e.g. serif)
> 4. list of all fonts on system
> 5. transliteration
> 6. question mark
>
> You can see these steps in the following pieces of code:
>
>
http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp
#2642
>
>
http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#310
8
>
> 1. "User-Defined" font (FindUserDefinedFont)
>
> We decided to include the User-Defined font functionality in Netscape 6
> again. It is similar to the old Netscape 4.X. Basically, if the user
> selects this encoding from the View menu, then the browser passes the
> bytes through to the font, untouched. This is for charsets that we don't
> already support. This step needs to be the first step, since it
> overrides everything else.
>
> 2. CSS font-family property (FindLocalFont)
>
> If the user hasn't selected User-Defined, we invoke this routine. It
> simply goes down the font-family list to find a font that exists and
> that contains a glyph for the current charac

Re: displaying Unicode text (was Re: Transcriptions of "Unicode")

2000-12-07 Thread Katsuhiko Momoi

Much of what Erik discussed below is also in the Unicode 17 presentation 
of mine called "International Features of  New Netscape 6 Browser", a 
paper version of which you will find in the proceedings of  IUC 17. 
(Section 4.4).
The PP presentation (outline only) is available at:

http://home.netscape.com/eng/intl/docs/iuc17/browser/iuc17browser.html

See section 4.4.
- Kat

Erik van der Poel wrote:

> Mark Davis wrote:
> 
>> Let's take an example.
>> 
>> - The page is UTF-8.
>> - It contains a mixture of German, dingbats and Hindi text.
>> - My locale is de_DE.
>> 
>> From your description, it sounds like Modzilla works as follows:
>> 
>> - The locale maps (I'm guessing) to 8859-1
>> - 8859 maps to, say Helvetica.
>> - The dingbats and Hindi appear as boxes or question marks.
>> 
>> This would be pretty lame, so I hope I misunderstand you!!
> 
> 
> Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes,
> you've misunderstood me, but only because I abbreviated so much. Sorry.
> Let me try again, with more feeling this time.
> 
> Using the example above:
> 
> - The locale maps to "x-western" (ja_JP would map to "ja", so I've
> prepended "x-" for the "language groups" that don't exist in RFC 1766)
> 
> - x-western and CSS' sans-serif map to Arial
> 
> - The dingbats appear as dingbats if they are in Unicode and at least
> one of the dingbat fonts on the system has a Unicode cmap subtable
> (WingDings is a "symbol" font, so it doesn't have such a table), while
> the Hindi might display OK on some Windows systems if they have Hindi
> support (Mozilla itself does not support any Indic languages yet).
> 
> We could support the WingDings font if we add an entry for WingDings to
> the following table:
> 
> http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872
> 
> We just haven't done that yet.
> 
> Basically, Mozilla will look at all the fonts on the system to find one
> that contains a glyph for the current character.
> 
> The language group and user locale stuff that I mentioned earlier is
> only one part of the process -- the part that deals with the user's font
> preferences. I'll explain more of the rest of the process:
> 
> Mozilla implements CSS2's font matching algorithm:
> 
>   http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm
> 
> This states that *for each character* in the element, the implementation
> is supposed to go down the list of fonts in the font-family property, to
> find a font that exists and that contains a glyph for the current
> character. Mozilla implements this algorithm to the letter, which means
> that fonts are chosen for each character without regard for neighboring
> characters (unlike MSIE). This may actually have been a bad decision,
> since we sometimes end up with text that looks odd due to font changes.
> 
> Anyway, Mozilla's algorithm has the following steps:
> 
> 1. "User-Defined" font
> 2. CSS font-family property
> 3. CSS generic font (e.g. serif)
> 4. list of all fonts on system
> 5. transliteration
> 6. question mark
> 
> You can see these steps in the following pieces of code:
> 
> http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642
> 
> http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108
> 
> 1. "User-Defined" font (FindUserDefinedFont)
> 
> We decided to include the User-Defined font functionality in Netscape 6
> again. It is similar to the old Netscape 4.X. Basically, if the user
> selects this encoding from the View menu, then the browser passes the
> bytes through to the font, untouched. This is for charsets that we don't
> already support. This step needs to be the first step, since it
> overrides everything else.
> 
> 2. CSS font-family property (FindLocalFont)
> 
> If the user hasn't selected User-Defined, we invoke this routine. It
> simply goes down the font-family list to find a font that exists and
> that contains a glyph for the current character. E.g.:
> 
>   font-family: Arial, "MS Gothic", sans-serif;
> 
> 3. CSS generic font (FindGenericFont)
> 
> If the above fails, this routine tries to find a font for the CSS
> generic (e.g. sans-serif) that was found in the font-family property, if
> any, otherwise it falls back to the user's default (serif or
> sans-serif). This is where the font preferences come in, so this is
> where we try to determine the language group of the element. I.e. we
> take the LANG attribute of this element or a parent element if any,
> otherwise the language group of the document's charset, if
> non-Unicode-based, otherwise the user's locale's language group.
> 
> 4. list of all fonts on system (FindGlobalFont)
> 
> If the above fails, this routine goes through all fonts on the system,
> trying to find one that contains a glyph for the current character.
> 
> 5. transliteration (FindSubstituteFont)
> 
> If we still can't find a font for this character, we try a
> transliteration table. For example, the euro is mapped to the 3 ASCIIs
>

Re: displaying Unicode text (was Re: Transcriptions of "Unicode")

2000-12-07 Thread Erik van der Poel

Mark Davis wrote:
> 
> Let's take an example.
> 
> - The page is UTF-8.
> - It contains a mixture of German, dingbats and Hindi text.
> - My locale is de_DE.
> 
> From your description, it sounds like Modzilla works as follows:
> 
> - The locale maps (I'm guessing) to 8859-1
> - 8859 maps to, say Helvetica.
> - The dingbats and Hindi appear as boxes or question marks.
> 
> This would be pretty lame, so I hope I misunderstand you!!

Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes,
you've misunderstood me, but only because I abbreviated so much. Sorry.
Let me try again, with more feeling this time.

Using the example above:

- The locale maps to "x-western" (ja_JP would map to "ja", so I've
prepended "x-" for the "language groups" that don't exist in RFC 1766)

- x-western and CSS' sans-serif map to Arial

- The dingbats appear as dingbats if they are in Unicode and at least
one of the dingbat fonts on the system has a Unicode cmap subtable
(WingDings is a "symbol" font, so it doesn't have such a table), while
the Hindi might display OK on some Windows systems if they have Hindi
support (Mozilla itself does not support any Indic languages yet).

We could support the WingDings font if we add an entry for WingDings to
the following table:

http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#872

We just haven't done that yet.

Basically, Mozilla will look at all the fonts on the system to find one
that contains a glyph for the current character.

The language group and user locale stuff that I mentioned earlier is
only one part of the process -- the part that deals with the user's font
preferences. I'll explain more of the rest of the process:

Mozilla implements CSS2's font matching algorithm:

  http://www.w3.org/TR/REC-CSS2/fonts.html#algorithm

This states that *for each character* in the element, the implementation
is supposed to go down the list of fonts in the font-family property, to
find a font that exists and that contains a glyph for the current
character. Mozilla implements this algorithm to the letter, which means
that fonts are chosen for each character without regard for neighboring
characters (unlike MSIE). This may actually have been a bad decision,
since we sometimes end up with text that looks odd due to font changes.

Anyway, Mozilla's algorithm has the following steps:

1. "User-Defined" font
2. CSS font-family property
3. CSS generic font (e.g. serif)
4. list of all fonts on system
5. transliteration
6. question mark

You can see these steps in the following pieces of code:

http://lxr.mozilla.org/seamonkey/source/gfx/src/windows/nsFontMetricsWin.cpp#2642

http://lxr.mozilla.org/seamonkey/source/gfx/src/gtk/nsFontMetricsGTK.cpp#3108

1. "User-Defined" font (FindUserDefinedFont)

We decided to include the User-Defined font functionality in Netscape 6
again. It is similar to the old Netscape 4.X. Basically, if the user
selects this encoding from the View menu, then the browser passes the
bytes through to the font, untouched. This is for charsets that we don't
already support. This step needs to be the first step, since it
overrides everything else.

2. CSS font-family property (FindLocalFont)

If the user hasn't selected User-Defined, we invoke this routine. It
simply goes down the font-family list to find a font that exists and
that contains a glyph for the current character. E.g.:

  font-family: Arial, "MS Gothic", sans-serif;

3. CSS generic font (FindGenericFont)

If the above fails, this routine tries to find a font for the CSS
generic (e.g. sans-serif) that was found in the font-family property, if
any, otherwise it falls back to the user's default (serif or
sans-serif). This is where the font preferences come in, so this is
where we try to determine the language group of the element. I.e. we
take the LANG attribute of this element or a parent element if any,
otherwise the language group of the document's charset, if
non-Unicode-based, otherwise the user's locale's language group.

4. list of all fonts on system (FindGlobalFont)

If the above fails, this routine goes through all fonts on the system,
trying to find one that contains a glyph for the current character.

5. transliteration (FindSubstituteFont)

If we still can't find a font for this character, we try a
transliteration table. For example, the euro is mapped to the 3 ASCIIs
"EUR", which is useful on some Unix systems that don't have the euro
glyph yet. Actually, this transliteration step isn't even implemented on
Windows yet.

6. question mark (FindSubstituteFont)

If we can't find a transliteration, we fall back to the last resort --
the good ol' question mark.

That's it. I hope I didn't abbreviate too much this time!

Erik



Re: displaying Unicode text (was re: Transcriptions of "Unicode")

2000-12-06 Thread James Kass

John H. Jenkins wrote:

> At 3:57 PM -0800 12/6/00, James Kass wrote:
> >A Universal Character Set should not require mark-up/tags.
> 
> Au contraire, it's been implicit in the design of Unicode from the 
> beginning that markup/tags would be required in certain situations. 
>

Because of the 65536 character limitation ?  (Which no
longer applies.)
 
> >If the Japanese version of a Chinese character looks different
> >than the Chinese character, it *is* different.  In many cases,
> >"variant" does not mean "same".
> 
> But as a rule, the Japanese and Chinese would disagree with you here. 
> Certainly the IRG would disagree.  Few in the west would argue over 
> the fundamental unity of Fraktur and Roman variations of the Latin 
> alphabet; most of the Chinese/Japanese variations are on that order 
> or less.
> 

As our Asian friends come on-line, they will hopefully
contribute to the discussion in this regard.  The reason
I suspect that the Japanese would tend to agree is that
Unicode had not been widely accepted by the Japanese
user community.  

Perhaps if Unicode originated elsewhere, we would have 
had to deal with Greek/Latin/Cyrillic unification?  
(And we could say that since the "W" is really a ligature 
of two "V"s, it shouldn't have an explicit encoding...)

> >
> >When limited to BMP code points, CJK unification kind of made
> >sense.  In light of the new additional planes...
> >
> >The IRG seems to be doing a fine job.
> >
> 
> Here you've really lost me.  The IRG is unifying in plane 2, as well. 
> Nobody in the IRG has suggested that we abandon unification for plane 
> 2.
> 

I tried to respond to this in an earlier letter.  We don't 
even have CJK unification in the BMP, witness the blocks
U+8A00 to U+8B9f versus U+8BA0 to U+8C36.  Many of
the characters in the latter block are simplified versions
of the former.

U+8A02/U+8BA2
U+8A03/U+8BA3
U+8A0C/U+8BA7
U+8A41/U+8BC2
etc.

Fraktur and roman are both adaptations of the Latin
script, or stylistic variations just as italic and roman.  
The Japanese writing system is Japanese, but derived 
from Chinese.  As you say, some of the differences
are minimal, perhaps slight variation in stroke order,
but other differences are substantial.  In some cases,
the Japanese version may use a variant of a certain
radical component, or even a different radical.  I said
I think the IRG is doing a fine job because it is such a
monumental task, much progress is being made, and the
results of their work seem to reflect the expectations
of the various user communities involved.

Best regards,

James Kass.