Re: GB18030 != CP936 (Alternative project?)

Yongwei Wu Tue, 27 Feb 2007 01:19:29 -0800

Hi Tony,

On 2/27/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:

A.J.Mechelynck wrote:
> Yongwei Wu wrote:
>> Hi Tony,
>>
>> On 2/27/07, A.J.Mechelynck <[EMAIL PROTECTED]> wrote:
>>> Yongwei Wu wrote:
>>> [...]
>>> > If your purpose is only to provide a workaround for
>>> > LANG=zh_CN.GB18030, changing the environment variable inside main() of
>>> > Vim may be a better approach.
>>> >
>>> > Best regards,
>>> >
>>> > Yongwei
>>> >
>>>
>>> ... and, if the Chinese messages and menus _actually_ used by (g)vim
>>> don't use
>>> any GB18030 4-byte codepoints, it might even work. But only
>>> experiment will
>>> prove that. If they do use some 4-byte codepoints (which are supposed
>>> to be
>>> rare -- not less numerous than 1- and 2-byte codepoints but less
>>> commonly
>>> used), maybe synonyms or periphrases can be devised?
>>
>> No, I can guarantee that. I believe only Chinese linguists (or people
>> that will need to process very strange person names) will have chance
>> to use Chinese characters that cannot be encoded in two bytes :-).
>> Other cases that people want to use GB18030 include using non-Chinese
>> characters/symbols in a GBK-compatible encoding.
>>
>> Best regards,
>>
>> Yongwei
>>
>
> OK, so let me explain what I suggest and you try to poke holes in it.
> Since Vim doesn't support more-than-2-byte encodings (other than UTF-8
> and UCS-32) natively, we cannot set 'encoding' to GB18030. So what shall
> we do if the "locale" encoding is set to GB18030 at startup? (Am I
> correct in assuming that zh_CN.GB18030 is the "normal" locale setting in
> the PRC nowadays?)
>
> Setting $LANG to use GBK instead in main() will mean that any menus and
> messages, if written without 4-byte codepoints (and menus, AFAIK, do not
> include proper names or archaic characters) will display correctly.
>
> Later (i.e., in the vimrc), and with the proper safeguards, we can do
>
>     :if &tenc == "" | let &tenc = &enc | endif
>     :set enc=utf-8
>     :set fencs=ucs-bom,utf-8,gb18030,cp1252 " or something similar
>     :setglobal fenc=gb18030
>
> and GB18030 files (even with "rare" proper names or interspersed
> Cyrillic text) will be read and written correctly (and, IIUC, so will
> "variable" parts of messages containing not translations but literals,
> but only in gvim).
>
> I understand that the conversion GB18030 <=> UTF-8 is one-to-one but not
> necessarily fast, and requires a huge conversion table; but IIUC the
> iconv library can do it. Apart from this performance question, and from
> the fact that I deliberately omitted any mention of your new
> encoding-detection package, do you think the above holds water?
>
>
> Best regards,
> Tony.


Here is an alternative way to handle it, which may be "the right way" from a
conceptual point of view, and in the long term, though it may be much more
difficult from the coding point of view. It may or may not be "the right thing
to do" pragmatically:

Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other
words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert
when reading and writing, just like we already do for UTF-16le, UTF-16be,
UTF-32le and UTF-32be.


I do not think it worth while.  Though GB18030 is an important encoding
(GB2312 and GB18030 are national standards, while the interim GBK is
only a de facto standard owing to Microsoft Windows), I do not suppose
we would ever use characters only in GB18030 (but not in GBK) in menus
and messages.  Edward's patch was a hack to make Vim work well with Red
Hat, and what we need is just such a hack, only to avoid the side-effect
that true GB18030 files cannot be processed in Vim.

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/

Re: GB18030 != CP936 (Alternative project?)

Reply via email to