Re: GB18030 != CP936 (Alternative project?)

2007-02-27 Thread Yongwei Wu

Hi Tony,

On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:

A.J.Mechelynck wrote:
 Yongwei Wu wrote:
 Hi Tony,

 On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:
 Yongwei Wu wrote:
 [...]
  If your purpose is only to provide a workaround for
  LANG=zh_CN.GB18030, changing the environment variable inside main() of
  Vim may be a better approach.
 
  Best regards,
 
  Yongwei
 

 ... and, if the Chinese messages and menus _actually_ used by (g)vim
 don't use
 any GB18030 4-byte codepoints, it might even work. But only
 experiment will
 prove that. If they do use some 4-byte codepoints (which are supposed
 to be
 rare -- not less numerous than 1- and 2-byte codepoints but less
 commonly
 used), maybe synonyms or periphrases can be devised?

 No, I can guarantee that. I believe only Chinese linguists (or people
 that will need to process very strange person names) will have chance
 to use Chinese characters that cannot be encoded in two bytes :-).
 Other cases that people want to use GB18030 include using non-Chinese
 characters/symbols in a GBK-compatible encoding.

 Best regards,

 Yongwei


 OK, so let me explain what I suggest and you try to poke holes in it.
 Since Vim doesn't support more-than-2-byte encodings (other than UTF-8
 and UCS-32) natively, we cannot set 'encoding' to GB18030. So what shall
 we do if the locale encoding is set to GB18030 at startup? (Am I
 correct in assuming that zh_CN.GB18030 is the normal locale setting in
 the PRC nowadays?)

 Setting $LANG to use GBK instead in main() will mean that any menus and
 messages, if written without 4-byte codepoints (and menus, AFAIK, do not
 include proper names or archaic characters) will display correctly.

 Later (i.e., in the vimrc), and with the proper safeguards, we can do

 :if tenc ==  | let tenc = enc | endif
 :set enc=utf-8
 :set fencs=ucs-bom,utf-8,gb18030,cp1252  or something similar
 :setglobal fenc=gb18030

 and GB18030 files (even with rare proper names or interspersed
 Cyrillic text) will be read and written correctly (and, IIUC, so will
 variable parts of messages containing not translations but literals,
 but only in gvim).

 I understand that the conversion GB18030 = UTF-8 is one-to-one but not
 necessarily fast, and requires a huge conversion table; but IIUC the
 iconv library can do it. Apart from this performance question, and from
 the fact that I deliberately omitted any mention of your new
 encoding-detection package, do you think the above holds water?


 Best regards,
 Tony.

Here is an alternative way to handle it, which may be the right way from a
conceptual point of view, and in the long term, though it may be much more
difficult from the coding point of view. It may or may not be the right thing
to do pragmatically:

Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other
words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert
when reading and writing, just like we already do for UTF-16le, UTF-16be,
UTF-32le and UTF-32be.


I do not think it worth while.  Though GB18030 is an important encoding
(GB2312 and GB18030 are national standards, while the interim GBK is
only a de facto standard owing to Microsoft Windows), I do not suppose
we would ever use characters only in GB18030 (but not in GBK) in menus
and messages.  Edward's patch was a hack to make Vim work well with Red
Hat, and what we need is just such a hack, only to avoid the side-effect
that true GB18030 files cannot be processed in Vim.

Best regards,

Yongwei

--
Wu Yongwei
URL: http://wyw.dcweb.cn/


Re: GB18030 != CP936 (Alternative project?)

2007-02-27 Thread Bram Moolenaar

Tony Mechelynck wrote:

 Here is an alternative way to handle it, which may be the right way
 from a conceptual point of view, and in the long term, though it may
 be much more difficult from the coding point of view. It may or may
 not be the right thing to do pragmatically:
 
 Treat GB18030 as what it is, namely, a Unicode Transformation Format.
 In other words, whenever 'encoding' is set to GB18030, use UTF-8
 internally and convert when reading and writing, just like we already
 do for UTF-16le, UTF-16be, UTF-32le and UTF-32be.
 
 This, of course, also suffers from the performance problems related to 
 conversion GB18030 = UTF-8.

Converting various Unicode encodings to and from UTF-8 is trivial.
Conversion between GB18030 and UTF-8 requires iconv.  This is a huge
difference.  Also because the conversion may fail.

If we go this way it's probably better not to use tricks and explicitly
set 'encoding' to utf-8.  One would need to try this out to discover any
problems, e.g. with menus.  Try Motif: GTK is utf-8 based thus Motif is
more of a challange.

-- 
Spam seems to be something useful to novices.  Later you realize that
it's a bunch of indigestable junk that only clogs your system.
Applies to both the food and the e-mail!

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: GB18030 != CP936 (Alternative project?)

2007-02-26 Thread A.J.Mechelynck

A.J.Mechelynck wrote:

Yongwei Wu wrote:

Hi Tony,

On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:

Yongwei Wu wrote:
[...]
 If your purpose is only to provide a workaround for
 LANG=zh_CN.GB18030, changing the environment variable inside main() of
 Vim may be a better approach.

 Best regards,

 Yongwei


... and, if the Chinese messages and menus _actually_ used by (g)vim 
don't use
any GB18030 4-byte codepoints, it might even work. But only 
experiment will
prove that. If they do use some 4-byte codepoints (which are supposed 
to be
rare -- not less numerous than 1- and 2-byte codepoints but less 
commonly

used), maybe synonyms or periphrases can be devised?


No, I can guarantee that. I believe only Chinese linguists (or people
that will need to process very strange person names) will have chance
to use Chinese characters that cannot be encoded in two bytes :-).
Other cases that people want to use GB18030 include using non-Chinese
characters/symbols in a GBK-compatible encoding.

Best regards,

Yongwei



OK, so let me explain what I suggest and you try to poke holes in it. 
Since Vim doesn't support more-than-2-byte encodings (other than UTF-8 
and UCS-32) natively, we cannot set 'encoding' to GB18030. So what shall 
we do if the locale encoding is set to GB18030 at startup? (Am I 
correct in assuming that zh_CN.GB18030 is the normal locale setting in 
the PRC nowadays?)


Setting $LANG to use GBK instead in main() will mean that any menus and 
messages, if written without 4-byte codepoints (and menus, AFAIK, do not 
include proper names or archaic characters) will display correctly.


Later (i.e., in the vimrc), and with the proper safeguards, we can do

:if tenc ==  | let tenc = enc | endif
:set enc=utf-8
:set fencs=ucs-bom,utf-8,gb18030,cp1252  or something similar
:setglobal fenc=gb18030

and GB18030 files (even with rare proper names or interspersed 
Cyrillic text) will be read and written correctly (and, IIUC, so will 
variable parts of messages containing not translations but literals, 
but only in gvim).


I understand that the conversion GB18030 = UTF-8 is one-to-one but not 
necessarily fast, and requires a huge conversion table; but IIUC the 
iconv library can do it. Apart from this performance question, and from 
the fact that I deliberately omitted any mention of your new 
encoding-detection package, do you think the above holds water?



Best regards,
Tony.


Here is an alternative way to handle it, which may be the right way from a 
conceptual point of view, and in the long term, though it may be much more 
difficult from the coding point of view. It may or may not be the right thing 
to do pragmatically:


Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other 
words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert 
when reading and writing, just like we already do for UTF-16le, UTF-16be, 
UTF-32le and UTF-32be.


This, of course, also suffers from the performance problems related to 
conversion GB18030 = UTF-8.



Best regards,
Tony.
--
Love and scandal are the best sweeteners of tea.


Re: GB18030 != CP936 (Alternative project?)

2007-02-26 Thread Edward L. Fox

Hi Tony,

On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:

[...]
Here is an alternative way to handle it, which may be the right way from a
conceptual point of view, and in the long term, though it may be much more
difficult from the coding point of view. It may or may not be the right thing
to do pragmatically:

Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other
words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert
when reading and writing, just like we already do for UTF-16le, UTF-16be,
UTF-32le and UTF-32be.


There is still another problem. When using gvim under Windoze with
CP936 locale, we can only set the encoding to CP936. Or the messages
in gvim will become malformed characters. Could anybody offer a good
solution to this problem?


This, of course, also suffers from the performance problems related to
conversion GB18030 = UTF-8.


Best regards,
Tony.
--
Love and scandal are the best sweeteners of tea.




Regards,

Edward Leap Fox


Re: GB18030 != CP936 (Alternative project?)

2007-02-26 Thread A.J.Mechelynck

Edward L. Fox wrote:

Hi Tony,

On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:

[...]
Here is an alternative way to handle it, which may be the right way 
from a
conceptual point of view, and in the long term, though it may be much 
more
difficult from the coding point of view. It may or may not be the 
right thing

to do pragmatically:

Treat GB18030 as what it is, namely, a Unicode Transformation Format. 
In other
words, whenever 'encoding' is set to GB18030, use UTF-8 internally and 
convert

when reading and writing, just like we already do for UTF-16le, UTF-16be,
UTF-32le and UTF-32be.


There is still another problem. When using gvim under Windoze with
CP936 locale, we can only set the encoding to CP936. Or the messages
in gvim will become malformed characters. Could anybody offer a good
solution to this problem?


Yes, this is a /different/ problem. Gvim does support CP936 natively. But of 
course we may want to handle files containing non-CP936 data and that means 
switching encodings. The command-line messages must of course be output in the 
current 'encoding', whatever it is; but what about the menus? In the gvim 
'encoding' or in the locale encoding? (My memory is hazy on this point.) It 
may be as simple as selecting the right sequence for sourcing the menus, 
changing 'enc', and setting :lang mess to the new encoding, but there may be 
edge cases.





This, of course, also suffers from the performance problems related to
conversion GB18030 = UTF-8.


Best regards,
Tony.
--
Love and scandal are the best sweeteners of tea.




Regards,

Edward Leap Fox



Best regards,
Tony.
--
I don't know anything about music.  In my line you don't have to.
-- Elvis Presley


Re: GB18030 != CP936 (Alternative project?)

2007-02-26 Thread mbbill
Hello Edward,

Tuesday, February 27, 2007, 11:58:30 AM, you wrote:

?Hi Tony,

?On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote:
?[...]
?Here is an alternative way to handle it, which may be the right way from a
?conceptual point of view, and in the long term, though it may be much more
?difficult from the coding point of view. It may or may not be the right 
thing
?to do pragmatically:

?Treat GB18030 as what it is, namely, a Unicode Transformation Format. In 
other
?words, whenever 'encoding' is set to GB18030, use UTF-8 internally and 
convert
?when reading and writing, just like we already do for UTF-16le, UTF-16be,
?UTF-32le and UTF-32be.

?There is still another problem. When using gvim under Windoze with
?CP936 locale, we can only set the encoding to CP936. Or the messages
?in gvim will become malformed characters. Could anybody offer a good
?solution to this problem?

?This, of course, also suffers from the performance problems related to
?conversion GB18030 = UTF-8.


?Best regards,
?Tony.
?--
?Love and scandal are the best sweeteners of tea.



?Regards,

?Edward Leap Fox

I use these settings:

set encoding=utf-8
set langmenu=zh_CN.utf-8 this must be set before syntax on
set helplang=cn
language message zh_CN.utf-8




-- 
Best regards,
 mbbillmailto:[EMAIL PROTECTED]