Switching to UTF-8 and Gnome 1.2.x

2002-05-09 Thread Jungshik Shin


Hi,

In my transition to UTF-8, I found that Gnome 1.2.x has a lot of files
in mixed encodings. All *.desktop files and .directory files are in
mixed encodings. Entries for [ja] are in EUC-JP, entries for [de] are
in ISO-8859-1/15 and entries for [ru] are in KOI8-R and so on. On the
other hand, corresponding KDE files are all in UTF-8 so that I don't
need to change anything there. Anyway, thanks to Encoding module (to be
included in upcoming Perl 5.8 by default), I was able to write a simple
script to add ko_KR.UTF-8 entries for all [ko] entries in EUC-KR
in *desktop files and .directory files. Below is the list of
directories I have to run my script on:

/usr/share/apps
/usr/share/applets
/usr/share/applnk
/etc/X11/applnk
/usr/share/mc
$HOME/.gnome

Still, I got gibberish in Gnome tip of the day. It turned out that gnome
hint files (usually installed in /usr/share/gnome/hints) are Xml files
in mixed encodings. I don't think they're compliant to Xml standard
because I've never heard of Xml files in mixed encodings. So, I also
had to add ko_KR.UTF-8 entries for all [ko] entries. Even with this,
for some reason unknown to me, whenever I cross the 'boundary'(i.e.
from the last to the first or the other way around), I got gibberish.

Two other  places where languages are tied to encodings are
Gnome help (usually in /usr/share/gnome/help) and Gimp tips
(/usr/(local/)share/gimp/$version/tips/gimp_tips.[lang].txt) I also had
to make UTF-8 version of them.

I believe all these problems have been addressed in Gnome 2.0(RC?/beta),
but still Gnome 1.x are widely used. I thought my experience would
help others who want to move on to UTF-8 as well as distribution
builders.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-06 Thread Yann Dirson

On Thu, May 02, 2002 at 09:51:44AM +0900, Gaspar Sinai wrote:
 I am not much of an Emacs guy but if I were I would probably
 use QEmacs, which looks pretty decent to me:
 
http://fabrice.bellard.free.fr/qemacs/

I had a quick look at qemacs a couple of weeks ago, for other reasons
(namely docbook support), and found out that this is a project in its early
phases of development, nowhere near a full-blown editor.

-- 
Yann Dirson[EMAIL PROTECTED] |Why make M$-Bill richer  richer ?
Debian-related: [EMAIL PROTECTED] |   Support Debian GNU/Linux:
Pro:[EMAIL PROTECTED] |  Freedom, Power, Stability, Gratuity
 http://ydirson.free.fr/| Check http://www.debian.org/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-06 Thread Tomohiro KUBOTA

Hi,

At Mon, 6 May 2002 07:46:33 +0200,
Pablo Saratxaga wrote:

  In Hiragana/Katakana, processing of n is complex (though
  it may be less complex than Hangul).
 
 No. The N is just a kana like any other, no complexity at all involved.
 Complexity only happens when typing in latin letters. That is why
 the use of transliteration typing will always require an input
 method anyways, it cannot be handled with just Xkb.

In my above sentence, n is a Latin letter.  It may correspond to
HIRAGATA/KATAKANA LETTER N *or* 1st key stroke to n-a, n-i, n-u, n-e,
n-o, n-y-a, n-y-u, or n-y-o.  (Key strokes of n-y-a should give
HIRAGANA/KATAKANA LETTER NI and following HIRAGANA/KATAKANA LETTER
SMALL YA.)

Anyway, I understand your point that Latin - Hiragana/Katakana
cannot be implemented as xkb.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-06 Thread Jungshik Shin




On Mon, 6 May 2002, Pablo Saratxaga wrote:
 On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote:

  Note for xkb experts who don't know Hiragana/Katakana/Hangul:
  input methods of these scripts need backtracking.  For example,
  in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
  v: vowel) sequence.  When I hit c-v-c, it should represent one
  Hangul syllable c-v-c.  However, when I hit the next v, it
  should be two Hangul syllables of c-v c-v.

 That is only the case with 2-mode keyboard; with 3-mode keyboard there
 is no ambiguity, as there are three groups of keys V, C1, C2; allowing
 for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys

'V-C2 and C1-V-C2' should be 'C1-V and 'C1-V-C2' :-)

To go all the way to Xkb, even three-set keyboard array has to be
modified a little because some clusters of vowels and consonants
are not assigned separate keys, but have to be entered by a sequence
of keys assigned to basic/simple vowels and consonants. Alternatively,
programs have to be modified to truly support 'L+V+T*' model of Hangul
syllables as stipulated in TUS 3.0. p. 53.


 for each consoun: one for the leading syllab consoun, and one for the
 ending syllab consoun. (I think the small round glyph to fill an empty
 place in a syllab is always at place C2, that is, c-v is always written
 C1-V-C2 with a special C2 that is not written in latin transliteration)

  You almost got it right except that IEung ('ㅇ') is NULL at the
syllable onset position (i.e. it's a place holder for syllables that
begin with a vowel and does not appear in Latin transliteration). IEung
is not NULL at the syllable coda-position but corresponds to [ng] (IPA :
[ŋ] ) as in 'young'. To put in your way, V-C2 syllable is always written
as  IEung-V-C2 with IEung having no phonetic value. Here I assumed
we're not talking about the orthography of the 15th century ;-)

   Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Tomohiro KUBOTA

Hi,

At 02 May 2002 23:54:37 +1000,
Roger So wrote:

 Note that the source from Li18nux will try to use its own encoding
 conversion mechanisms on Linux, which is broken.  You need to tell it to
 use iconv instead.

I didn't know that because I am not a user of IIIMF nor other Li18nux
products.  How it is broken?


 Maybe I should attempt to package it for Debian again, now that woody is
 almost out of the way.  (I have the full IIIMF stuff working well on my
 development machine.)

I found that Debian has iiimecf package.  Do you know what it is?


 I don't think xkb is sufficient because (1) there's a large number of
 different Chinese input methods out there, and (2) most of the input
 methods require the user to choose from a list of candidates after
 preedit.
 
 I _do_ think xkb is sufficient for Japanese though, if you limit
 Japanese to only hiragana and katagana. ;)

I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean.  Thus, I think
native Japanese speakers won't decide to abolish Kanji.
(Please don't be kidding in international mailing list, because
people who don't know about Japanese may think you are talking
about serious story.)

Even if we limit to input of hiragana/katakana, xkb may not be
sufficient.  For one-key-one-hiragana/katakana method, I think
xkb can be used.  However, more than half of Japanese computer
users use Romaji-kana conversion, two-keys-one-hiragana/katakana
method.  The complexity of the algorithm is like two or three-key
input method of Hangul, I think.  Do you think such an algorithm
can be implemented as xkb?  If yes, I think Romaji-kana conversion
(whose complexity is like Hangul input method) can be implemented
as xkb.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Roger So

On Sun, 2002-05-05 at 21:00, Tomohiro KUBOTA wrote:
 At 02 May 2002 23:54:37 +1000,
 Roger So wrote:
  Note that the source from Li18nux will try to use its own encoding
  conversion mechanisms on Linux, which is broken.  You need to tell it to
  use iconv instead.
 
 I didn't know that because I am not a user of IIIMF nor other Li18nux
 products.  How it is broken?

The csconv library that IIIMF comes with doesn't work properly (at least
I didn't get it to work), possibly because of endianess issues.  csconv
is meant to be a cross-platform replacement for iconv.

  Maybe I should attempt to package it for Debian again, now that woody is
  almost out of the way.  (I have the full IIIMF stuff working well on my
  development machine.)
 
 I found that Debian has iiimecf package.  Do you know what it is?

It's the IIIM Emacs Client Framework.  As the name implies, it's an
implementation of an IIIM client in Emacs.  I've never tried it out, as
I don't use Emacs. :)

Is it used by anyone?  Last time I checked, popularity-contest said
nobody was using it...

  I _do_ think xkb is sufficient for Japanese though, if you limit
  Japanese to only hiragana and katagana. ;)
 
 I believe that you are kidding to say about such a limitation.
 Japanese language has much less vowels and consonants than Korean,
 which results in much more homonyms than Korean.  Thus, I think
 native Japanese speakers won't decide to abolish Kanji.
 (Please don't be kidding in international mailing list, because
 people who don't know about Japanese may think you are talking
 about serious story.)

Sorry, it wasn't meant to be a serious comment. :)

Cheers

Roger
-- 
  Roger So Debian Developer
  Sun Wah Linux Limitedi18n/L10n Project Leader
  Tel: +852 2250 0230  [EMAIL PROTECTED]
  Fax: +852 2259 9112  http://www.sw-linux.com/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Jungshik Shin



On Sun, 5 May 2002, Tomohiro KUBOTA wrote:

 At 02 May 2002 23:54:37 +1000,
 Roger So wrote:

  I _do_ think xkb is sufficient for Japanese though, if you limit
  Japanese to only hiragana and katagana. ;)

 I believe that you are kidding to say about such a limitation.
 Japanese language has much less vowels and consonants than Korean,
 which results in much more homonyms than Korean.  Thus, I think

  Well, actually it's due to not so much the difference in
the number of consonants and vowels as  the fact that Korean has
both closed and open syllables while Japanese has only open syllables
that makes Japanese have a lot more homonyms than Korean.

 native Japanese speakers won't decide to abolish Kanji.

  I don't think Japanese will ever do, either.  However, I'm afraid
having too many homonyms is a little too 'feeble' a 'rationale' for
not being able to convert to all phonetic scripts like Hiragana and
Katakana. The easiest counter argument to that is how Japanese speakers
can tell which homonym is meant in oral communication if Kanji is so
important to disambiguate among homonyms. They don't have any Kanjis to
help them, (well, sometimes you may have to write down Kanjis to break
the ambiguity in the middle of conversation, but I guess it's mostly
limited to proper nouns). I heard that they don't have much trouble
because the context helps a listener a lot with figuring out which
of many homonyms is meant by a speaker. This is true in any language.
Arguably, the same thing could help readers in written communication.
Of course, using logographic/ideographic characters like Kanji certainly
helps readers very much and that should be a very good reason for Japanese
to keep Kanji in their writing system.

  English writing system is also 'logographic' in a sense (so is modern
Korean orthography in pure Hangul as it departs from the strict agreement
between pronunciation and spelling )  and a spelling reform (to make
English have a similar degree of the agreement between spelling and
pronunciation as to that in Spanish) would make it harder to read written
text depriving English written text of its 'logographic' nature. On the
other hand, it would help learners  and writers. It's always been struggle
between readers vs writers and listeners vs speakers

 xkb can be used.  However, more than half of Japanese computer
 users use Romaji-kana conversion, two-keys-one-hiragana/katakana
 method.  The complexity of the algorithm is like two or three-key
 input method of Hangul, I think.  Do you think such an algorithm
 can be implemented as xkb?  If yes, I think Romaji-kana conversion
 (whose complexity is like Hangul input method) can be implemented
 as xkb.

  I also like to know whether it's possible with Xkb.  BTW, if
we use three-set keyboards (where leading consonants and trailing
consonants are assigned separate keys) and use U+1100 Hangul Conjoining
Jamos, Korean Hangul input is entirely possible with Xkb alone.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Tomohiro KUBOTA

Hi,

At Sun, 5 May 2002 19:12:31 -0400 (EDT),
Jungshik Shin wrote:

  I believe that you are kidding to say about such a limitation.
  Japanese language has much less vowels and consonants than Korean,
  which results in much more homonyms than Korean.  Thus, I think
 
   Well, actually it's due to not so much the difference in
 the number of consonants and vowels as  the fact that Korean has
 both closed and open syllables while Japanese has only open syllables
 that makes Japanese have a lot more homonyms than Korean.

You may be right.  Anyway, the true reason is that Japanese
language has a lot of words from old Chinese.  These words
which are not homonyms in Chinese will be homonyms in Japanese.
(They may or may not be homonys in Korea.  I believe that 
Korean also has a lot of Chinese-origin words.)  Since a way to
coin a new word is based on Kanji system, Japanese language
would lose vitality without Kanji.

   I don't think Japanese will ever do, either.  However, I'm afraid
 having too many homonyms is a little too 'feeble' a 'rationale' for
 not being able to convert to all phonetic scripts like Hiragana and
 Katakana.
 ...

Since I don't represent Japanese people, I don't say whether it is
a good idea or not to have many homonyms.  You are right, there
are many other reasons for/against using Kanji and I cannot 
explain everything.

Japanese pronunciation does have troubles, though it is widely
helped by accents or rhythms.  However, in some cases, none of
accesnts or context can help.  For example, both science and
chemistry are kagaku in japanese.  So we sometimes call
chemistry as bakegaku, where bake is another reading of
ka for chemistry.  Another famous confusing pair of words
is private (organization) and municipal (organization),
which is called shiritu.  Thus, private is sometimes
called watakushiritu and municipal is called ichiritu,
again these alias names are from different readings of kanji.
If you listen to Japanese news programs every day, you will
find these examples some day.

These days more and more Japanese people want to learn more
Kanji to use their abundance of power of expression, though
I am not one of these Kanji learners.


   I also like to know whether it's possible with Xkb.  BTW, if
 we use three-set keyboards (where leading consonants and trailing
 consonants are assigned separate keys) and use U+1100 Hangul Conjoining
 Jamos, Korean Hangul input is entirely possible with Xkb alone.

Note for xkb experts who don't know Hiragana/Katakana/Hangul:
input methods of these scripts need backtracking.  For example,
in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
v: vowel) sequence.  When I hit c-v-c, it should represent one
Hangul syllable c-v-c.  However, when I hit the next v, it
should be two Hangul syllables of c-v c-v. 

In Hiragana/Katakana, processing of n is complex (though
it may be less complex than Hangul).

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-05 Thread Pablo Saratxaga

Kaixo!

On Mon, May 06, 2002 at 10:11:34AM +0900, Tomohiro KUBOTA wrote:

 Note for xkb experts who don't know Hiragana/Katakana/Hangul:
 input methods of these scripts need backtracking.  For example,
 in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
 v: vowel) sequence.  When I hit c-v-c, it should represent one
 Hangul syllable c-v-c.  However, when I hit the next v, it
 should be two Hangul syllables of c-v c-v. 

That is only the case with 2-mode keyboard; with 3-mode keyboard there
is no ambiguity, as there are three groups of keys V, C1, C2; allowing
for all the possible combinations: V-C2, C1-V-C2. Eg: there are two keys
for each consoun: one for the leading syllab consoun, and one for the
ending syllab consoun. (I think the small round glyph to fill an empty
place in a syllab is always at place C2, that is, c-v is always written
C1-V-C2 with a special C2 that is not written in latin transliteration) 

 In Hiragana/Katakana, processing of n is complex (though
 it may be less complex than Hangul).

No. The N is just a kana like any other, no complexity at all involved.
Complexity only happens when typing in latin letters. That is why
the use of transliteration typing will always require an input
method anyways, it cannot be handled with just Xkb.


 
 ---
 Tomohiro KUBOTA [EMAIL PROTECTED]
 http://www.debian.or.jp/~kubota/
 Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
 --
 Linux-UTF8:   i18n of Linux on all levels
 Archive:  http://mail.nl.linux.org/linux-utf8/

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-02 Thread Glenn Maynard

On Thu, May 02, 2002 at 02:03:06AM -0400, Jungshik Shin wrote:
   I know very little about Win32 APIs, but according to  what little
 I learned from Mozilla source code, it doesn't seem to be so simple as
 you wrote in Windows, either.  Actually, my impression is that Windows
 IME APIs are almost parallel (concept-wise) to those of XIM APIs.  (btw,
 MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In
 both cases, you have to determine what type of preediting support
 (in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?)
 is shared by clients and IM server. Depending on the preediting type,
 the amount of works to be done by clients varies.
 
   I'm afraid your impression that Windows IME clients have very little
 to do to get keyboard input comes from your not having written programs
 that can accept input from CJK IME(input method editors) as it appears
 to be confirmed by what I'm quoting below.

I wrote the patch for PuTTY to accept input from Win2K's IME, and some
fixes for Vim's.  What I said is all that's necessary for simple
support, and the vast majority of applications don't need any more than
that.

Of course, what you do with this input is up to the application, and if
you have no support for storing anything but text in the system codepage,
there might be a lot of work to do.  That's a different topic entirely,
of course.

   It just occurred to me that Mozilla.org has an excellent summary
 of input method supports on three major platforms (Unix/X11, MacOS,
 MS-Windows). See
 
   http://www.mozilla.org/projects/intl/input-method-spec.html.

I've never seen any application do anything other than what this
describes as Over-The-Spot composition.  This includes system dialogs,
Word, Notepad and IE.

This document incorrectly says:

Windows does not use the off-the-spot or over-the-spot styles of input.

As far as I know, Windows uses *only* over-the-spot input.  Perhaps
on-the-spot can be implemented (and most people would probably agree
that it's cosmetically better), but it would proably take a lot more
work.

Ex:
http://zewt.org/~glenn/over1.jpg
http://zewt.org/~glenn/over2.jpg

(The rest of the first half of the document describes input styles that
most programs don't use.)  The document states Last modified May 18,
1999, so the information on it is probably out of date.

The only other thing you have to handle is described in Platform
Protocols: WM_IME_COMPOSITION.  The other two messages can be ignored.

The only API function listed here that's often needed is SetCaretPosition,
to set the cursor position.

  It's little enough to add it easily to programs, but the fact that it
  exists at all means that I can't enter CJK into most programs.  Since
  the regular 8-bit character message is in the system codepage, it's
  impossible to send CJK through.
 
   Even in English or any SBCS-based Windows 9x/ME, you
 can write programs that can accept CJK characters from CJK (global)
 IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples.

Yes, you're agreeing with what you quoted.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-02 Thread Tomohiro KUBOTA

Hi,

At Thu, 2 May 2002 02:14:29 -0400 (EDT),
Jungshik Shin wrote:

   You mean IIIMF, didn't you? If there's any actual implementation,
 I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X
 style keyboard/IM switching mechanism/UI so that  keyboard/IM modules
 targeted at/customized for each language can coexist and be brought up as
 necessary. It appears that IIIMF seems to be the only way unless somebody
 writes a gigantic one-fits-all XIM server for UTF-8 locale(s).

I heard that IIIMF has some security problems from Project HEKE
people http://www.kmc.gr.jp/proj/heke/ .  I don't know whether
it is true or not, nor the problem (if any) is solved or not.

There _is_ already an implementation of IIIMF.  You can download
it from Li18nux site.  However, I could not succeeded to try it.
Since I have heard several reports of IIIMF users, it is simply
my fault.

There seems to be some XIM-based implementations which can input
multiple complex languages.

One is ximswitch software in Kondara Linux distribution.
http://www.kondara.org .  I downloaded it but I didn't test it yet.

Another is mlterm http://mlterm.sourceforge.net/ which is entirely
client-side solution to switch multiple XIM servers.  Though I
don't think it is a good idea to require clients to have such
mechanisms, it is the only practical way so far to realize multiple
language input.


   How about just running your favorite XIM under ja_JP.EUC-JP while
 all other applications are launched under ja_JP.UTF-8? As you know well,
 it just works fine although the character repertoire you can enter
 is limited to that of EUC-JP. Of course, this is not full-blown UTF-8
 support, but at least it should give you the same degree of Japanese
 input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then
 you would say what the point of moving to UTF-8 is. You can at least
 display more characters  under UTF-8 than under EUC-JP, can't you? :-)

There are, so far, no conversion engine which requires over-EUC-JP
character set.  Thus, EUC-JP is enough now.  If someone wants to
develop an input engine which supports more characters, he/she will
want to use UTF-8.  However, I think nobody feels strong necessity
of it in Japan, besides pure technical interests for Unicode itself.


   BTW, Xkb may work for Korean Hangul, too and we don't need
 XIM  if we use 'three-set keyboard' instead of 'two-set keyboard' and can
 live without Hanjas.  I have to know more about Xkb to be certain, though.

I see.  This is not true for Japanese.  Japanese people do need
grammar and context analysis software to get Kanji text.
How about Chinese?


---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: readline (was: Switching to UTF-8)

2002-05-02 Thread Bruno Haible

Markus Kuhn writes:

 There is also bash/readline

SuSE 8.0 ships with a bash/readline that works fine with (at least)
width 1 characters in an UTF-8 locale.

There is also an alpha release of a readline version that attempts to
handle single-width, double-width and zero-width characters in all
multibyte locales. But it's alpha (read: it doesn't work for me yet).

Bruno
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-02 Thread Roger So

On Thu, 2002-05-02 at 17:11, Tomohiro KUBOTA wrote:
 There _is_ already an implementation of IIIMF.  You can download
 it from Li18nux site.  However, I could not succeeded to try it.
 Since I have heard several reports of IIIMF users, it is simply
 my fault.

Note that the source from Li18nux will try to use its own encoding
conversion mechanisms on Linux, which is broken.  You need to tell it to
use iconv instead.

Maybe I should attempt to package it for Debian again, now that woody is
almost out of the way.  (I have the full IIIMF stuff working well on my
development machine.)

BTW, Xkb may work for Korean Hangul, too and we don't need
  XIM  if we use 'three-set keyboard' instead of 'two-set keyboard' and can
  live without Hanjas.  I have to know more about Xkb to be certain, though.
 
 I see.  This is not true for Japanese.  Japanese people do need
 grammar and context analysis software to get Kanji text.
 How about Chinese?

I don't think xkb is sufficient because (1) there's a large number of
different Chinese input methods out there, and (2) most of the input
methods require the user to choose from a list of candidates after
preedit.

I _do_ think xkb is sufficient for Japanese though, if you limit
Japanese to only hiragana and katagana. ;)

Regards
Roger
-- 
  Roger So Debian Developer
  Sun Wah Linux Limitedi18n/L10n Project Leader
  Tel: +852 2250 0230  [EMAIL PROTECTED]
  Fax: +852 2259 9112  http://www.sw-linux.com/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: readline (was: Switching to UTF-8)

2002-05-02 Thread Markus Kuhn

Bruno Haible wrote on 2002-05-02 12:23 UTC:
 There is also an alpha release of a readline version that attempts to
 handle single-width, double-width and zero-width characters in all
 multibyte locales. But it's alpha (read: it doesn't work for me yet).

Yes, it seems the train is rolling now for UTF-8 support in
bash/readline as well, which is excellent news.

ftp://ftp.cwru.edu/hidden/bash-2.05b-alpha1.tar.gz
ftp://ftp.cwru.edu/hidden/readline-4.3-alpha1.tar.gz

Anyone interested in joining the bash-testers list to help iron out any
problems with UTF-8 support in bash/readline should contact
Chet Ramey [EMAIL PROTECTED].

http://cnswww.cns.cwru.edu/~chet/readline/rltop.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: http://www.cl.cam.ac.uk/~mgk25/

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Florian Weimer

Markus Kuhn [EMAIL PROTECTED] writes:

   c) Emacs - Current Emacs UTF-8 support is still a bit too provisional
  for my comfort. In particular, I don't like that the UTF-8 mode is not
  binary transparent. Work on turning Emcas completely into a UTF-8
  editor is under way, and I'd be very curious to hear about the
  current status and whether there is anything to test already.
  Anyone?

AFAIK, there is some activity on the Emacs 22 branch.  XEmacs is in
the process of switching to UCS for its internal character set, too.
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Gaspar Sinai

On Wed, 1 May 2002, Florian Weimer wrote:
 Markus Kuhn [EMAIL PROTECTED] writes:

c) Emacs - Current Emacs UTF-8 support is still a bit too provisional
   for my comfort. In particular, I don't like that the UTF-8 mode is not
   binary transparent. Work on turning Emcas completely into a UTF-8
   editor is under way, and I'd be very curious to hear about the
   current status and whether there is anything to test already.
   Anyone?

 AFAIK, there is some activity on the Emacs 22 branch.  XEmacs is in
 the process of switching to UCS for its internal character set, too.

I am not much of an Emacs guy but if I were I would probably
use QEmacs, which looks pretty decent to me:

   http://fabrice.bellard.free.fr/qemacs/

As I don't use Emacs so I can not really tell the difference,
it might not have all the functionality that Emacs has. But
I have a feeling that the functionality you can expect from a
text editor is there.

I like that Qemacs has a much smaller memory and binary size
than “mainstream” Emacs.

Open Source is funny: you probably will never hear Microsoft
praising Java ☺

Gáspár・ガーシュパール・Гашьпар・갓팔・Γασπαρ
ᏱᎦᏊ ᎣᏌᏂᏳ ᎠᏓᏅᏙ ᎠᏓᏙᎵᎩ ᏂᎪᎯᎸᎢ ᎾᏍᏋ 
ᎤᏠᏯᏍᏗ ᏂᎯ.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Tomohiro KUBOTA

Hi,

At Wed, 01 May 2002 20:02:57 +0100,
Markus Kuhn wrote:

 I have for some time now been using UTF-8 more frequently than
 ISO 8859-1. The three critical milestones that still keep me from
 moving entirely to UTF-8 are

How about bash?  Do you know any improvement?

Please note that tcsh have already supported east Asian EUC-like
multibyte encodings.  I don't know it also supports UTF-8.

How about zsh?


For Japanese, character width problems and mapping table problems
should be solved to _start_ migration to UTF-8.  (This is why
several Japanese localization patches are available for several
UTF-8-based softwares such as Mutt.  We should find ways to stop
such localization patches.)

Also, I want people who develop UTF-8-based softwares to have
a custom to specify the range of UTF-8 support.  For example,

 * range of codepoints
U+ - U+2fff?  all BMP? SMP/SIP?

 * special processings
combining characters?  bidi?  Arab shaping?  Indic scripts?
Mongol (which needs vertical direction)?  How about wcwidth()?

 * input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?

 * fonts availability
   Though each software is not responsible for this, This software
   is designed to require Times font means that it cannot use
   non-Latin/Greek/Cyrillic characters.

Though people in ISO-8859-1/2/15 region people don't have to care
about these terms, other peole can easily believe a UTF-8-supported
software and then disappointed to use it.  Then he/she will become
distrust UTF-8-supported softwares.  We should avoid many people
will become such.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Glenn Maynard

On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote:
  * input methods
 Any way to input complex languages which cannot be supported
 by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
 Or, any software-specific input methods (like Emacs or Yudit)?

How much extra work do X apps currently need to do to support input
methods?

In Windows, you do need to do a little--there's a small API to tell the
input method the cursor position (for when it opens a character selection
box) and to receive characters.  (The former can be omitted and it'll
still be usable, if annoying--the dialog will be at 0x0.  The latter can
be omitted for Unicode-based programs, or if the system codepage happens
to match the characters.)

It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs.  Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.

How does this compare with the situation in X?

  * fonts availability
Though each software is not responsible for this, This software
is designed to require Times font means that it cannot use
non-Latin/Greek/Cyrillic characters.

I can't think of ever using an (untranslated, English) X program and having
it display anything but Latin characters.  When is this actually a problem?

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Tomohiro KUBOTA

Hi,

At Thu, 2 May 2002 00:16:25 -0400,
Glenn Maynard wrote:

   * input methods
  Any way to input complex languages which cannot be supported
  by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
  Or, any software-specific input methods (like Emacs or Yudit)?
 
 How much extra work do X apps currently need to do to support input
 methods?

Much work.  I think this is one problematic point of XIM which
caused very few softwares (which are developed by XIM-knowing
developers, who are very few) can input CJK languages.

X.org distribution (and XFree86 distribution) has a specification
of XIM protocol.  However, it is difficult.  (At least I could not
understand it).  So, for practical usage by developers,
http://www.ainet.or.jp/~inoue/im/index-e.html
would be useful to develop XIM clients.  I have not read a good
introduction article to develop XIM servers.

I think that low-level API should integrate XIM (or other input 
method protocols) support so that XIM-innocent developers (well,
almost all developers in the world) can use it and they cannot
annoy CJK people.  Gnome2 seems to take this way.  However, I
wonder why Xlib doesn't have such wrapper functions which omit
XIM programming troubles.


 It's little enough to add it easily to programs, but the fact that it
 exists at all means that I can't enter CJK into most programs.  Since
 the regular 8-bit character message is in the system codepage, it's
 impossible to send CJK through.

Well, I am talking about Unicode-based softwares.  More and more
developers in the world start to understand that 8bit is not enough
for Unicode because it is a unversal fact.  I am optimistic in this
field; many developers will think 8bit character is a bad idea in
near future.  However, it is unlikely many developers will recognize
the need of XIM (or other input method) support in near future because
it is needed only for CJK languages.  My concern is how to force thse
XIM-innocent people to develop CJK-supporting softwares.


 How does this compare with the situation in X?

Though I don't know about Windows programming, I often use Windows
for my work.  Imported softwares usually cannot handle Japanese
because of font problem.  However, input method (IME?) seems to be
invoked even in these imported softwares.


   * fonts availability
 Though each software is not responsible for this, This software
 is designed to require Times font means that it cannot use
 non-Latin/Greek/Cyrillic characters.
 
 I can't think of ever using an (untranslated, English) X program and having
 it display anything but Latin characters.  When is this actually a problem?

For example, XCreateFontSet(-*-times-*) cannot display Japanese
because there are no Japanese fonts which meet the name.  (Instead,
mincho and gothic are popular Japanese typefaces.)  Such
types of implementation is often seen in window managers and their
theme files.

---
Tomohiro KUBOTA [EMAIL PROTECTED]
http://www.debian.or.jp/~kubota/
Introduction to I18N  http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Jungshik Shin




On Thu, 2 May 2002, Glenn Maynard wrote:

 On Thu, May 02, 2002 at 11:38:38AM +0900, Tomohiro KUBOTA wrote:
   * input methods
  Any way to input complex languages which cannot be supported
  by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)
  Or, any software-specific input methods (like Emacs or Yudit)?

 How much extra work do X apps currently need to do to support input
 methods?

 In Windows, you do need to do a little--there's a small API to tell the
 input method the cursor position (for when it opens a character selection
...
 How does this compare with the situation in X?


  I know very little about Win32 APIs, but according to  what little
I learned from Mozilla source code, it doesn't seem to be so simple as
you wrote in Windows, either.  Actually, my impression is that Windows
IME APIs are almost parallel (concept-wise) to those of XIM APIs.  (btw,
MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In
both cases, you have to determine what type of preediting support
(in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?)
is shared by clients and IM server. Depending on the preediting type,
the amount of works to be done by clients varies.


  I'm afraid your impression that Windows IME clients have very little
to do to get keyboard input comes from your not having written programs
that can accept input from CJK IME(input method editors) as it appears
to be confirmed by what I'm quoting below.

  It just occurred to me that Mozilla.org has an excellent summary
of input method supports on three major platforms (Unix/X11, MacOS,
MS-Windows). See

  http://www.mozilla.org/projects/intl/input-method-spec.html.

 It's little enough to add it easily to programs, but the fact that it
 exists at all means that I can't enter CJK into most programs.  Since
 the regular 8-bit character message is in the system codepage, it's
 impossible to send CJK through.

  Even in English or any SBCS-based Windows 9x/ME, you
can write programs that can accept CJK characters from CJK (global)
IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples.

   Jungshik Shin


--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/




Re: Switching to UTF-8

2002-05-01 Thread Jungshik Shin




On Thu, 2 May 2002, Tomohiro KUBOTA wrote:

 At Wed, 01 May 2002 20:02:57 +0100,
 Markus Kuhn wrote:

  I have for some time now been using UTF-8 more frequently than
  ISO 8859-1. The three critical milestones that still keep me from
  moving entirely to UTF-8 are

 How about bash?  Do you know any improvement?

 Please note that tcsh have already supported east Asian EUC-like
 multibyte encodings.  I don't know it also supports UTF-8.

  It doesn't seem to support UTF-8 locale as of tcsh 6.10.0
(2000-11-19). I can't find anything about UTF-8 at http://www.tcsh.org.
The newest release is 6.11.0 The same is true of zsh.
(http://www.zsh.org)

 combining characters?  bidi?  Arab shaping?  Indic scripts?
   and Hangul :-)
 Mongol (which needs vertical direction)?  How about wcwidth()?

  Pango and ST should certainly help, here

  * input methods
 Any way to input complex languages which cannot be supported
 by xkb mechanism (i.e., CJK) ?  XIM? IIIMP? (How about Gnome2?)

  You mean IIIMF, didn't you? If there's any actual implementation,
I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X
style keyboard/IM switching mechanism/UI so that  keyboard/IM modules
targeted at/customized for each language can coexist and be brought up as
necessary. It appears that IIIMF seems to be the only way unless somebody
writes a gigantic one-fits-all XIM server for UTF-8 locale(s).

  How about just running your favorite XIM under ja_JP.EUC-JP while
all other applications are launched under ja_JP.UTF-8? As you know well,
it just works fine although the character repertoire you can enter
is limited to that of EUC-JP. Of course, this is not full-blown UTF-8
support, but at least it should give you the same degree of Japanese
input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then
you would say what the point of moving to UTF-8 is. You can at least
display more characters  under UTF-8 than under EUC-JP, can't you? :-)

  In Korean case, as I wrote a couple of days ago, I had to
modify Ami (a popular Korean XIM) to make it run under ko_KR.UTF-8
because otherwise even though my applications are running under and
fully aware of UTF-8 (e.g. vim under UTF-8 xterm), I couldn't enter
over 8,000 Hangul syllables not in EUC-KR but in UTF-8.  Moreover,
under ko_KR.UTF-8, Xterm-16x and Vim 6.1 with a single line patch  works
almost flawlessly with U+1100 Hangul Jamos. Markus, can you update your
UTF-8 FAQ on this issue?  Xterm has been supporting Thai script and that
certainly brought in almost automagically Middle Korean support as
a by-product.

  BTW, Xkb may work for Korean Hangul, too and we don't need
XIM  if we use 'three-set keyboard' instead of 'two-set keyboard' and can
live without Hanjas.  I have to know more about Xkb to be certain, though.

 Or, any software-specific input methods (like Emacs or Yudit)?

  Yudit supports Indic, Thai, Arabic pretty well as far as I know.
And, judging from what Gaspar wrote to me, Middle Korean support with
U+1100 jamo is not so far away. Most of what's necessary is firmly in
place because Gaspar has written a very generic complex script support
routines which hopefully can be used for Middle Korean without much
effort.

  Jungshik Shin

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/