Re: GB18030 != CP936 (Alternative project?)
Hi Tony, On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: A.J.Mechelynck wrote: Yongwei Wu wrote: Hi Tony, On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: Yongwei Wu wrote: [...] If your purpose is only to provide a workaround for LANG=zh_CN.GB18030, changing the environment variable inside main() of Vim may be a better approach. Best regards, Yongwei ... and, if the Chinese messages and menus _actually_ used by (g)vim don't use any GB18030 4-byte codepoints, it might even work. But only experiment will prove that. If they do use some 4-byte codepoints (which are supposed to be rare -- not less numerous than 1- and 2-byte codepoints but less commonly used), maybe synonyms or periphrases can be devised? No, I can guarantee that. I believe only Chinese linguists (or people that will need to process very strange person names) will have chance to use Chinese characters that cannot be encoded in two bytes :-). Other cases that people want to use GB18030 include using non-Chinese characters/symbols in a GBK-compatible encoding. Best regards, Yongwei OK, so let me explain what I suggest and you try to poke holes in it. Since Vim doesn't support more-than-2-byte encodings (other than UTF-8 and UCS-32) natively, we cannot set 'encoding' to GB18030. So what shall we do if the locale encoding is set to GB18030 at startup? (Am I correct in assuming that zh_CN.GB18030 is the normal locale setting in the PRC nowadays?) Setting $LANG to use GBK instead in main() will mean that any menus and messages, if written without 4-byte codepoints (and menus, AFAIK, do not include proper names or archaic characters) will display correctly. Later (i.e., in the vimrc), and with the proper safeguards, we can do :if tenc == | let tenc = enc | endif :set enc=utf-8 :set fencs=ucs-bom,utf-8,gb18030,cp1252 or something similar :setglobal fenc=gb18030 and GB18030 files (even with rare proper names or interspersed Cyrillic text) will be read and written correctly (and, IIUC, so will variable parts of messages containing not translations but literals, but only in gvim). I understand that the conversion GB18030 = UTF-8 is one-to-one but not necessarily fast, and requires a huge conversion table; but IIUC the iconv library can do it. Apart from this performance question, and from the fact that I deliberately omitted any mention of your new encoding-detection package, do you think the above holds water? Best regards, Tony. Here is an alternative way to handle it, which may be the right way from a conceptual point of view, and in the long term, though it may be much more difficult from the coding point of view. It may or may not be the right thing to do pragmatically: Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert when reading and writing, just like we already do for UTF-16le, UTF-16be, UTF-32le and UTF-32be. I do not think it worth while. Though GB18030 is an important encoding (GB2312 and GB18030 are national standards, while the interim GBK is only a de facto standard owing to Microsoft Windows), I do not suppose we would ever use characters only in GB18030 (but not in GBK) in menus and messages. Edward's patch was a hack to make Vim work well with Red Hat, and what we need is just such a hack, only to avoid the side-effect that true GB18030 files cannot be processed in Vim. Best regards, Yongwei -- Wu Yongwei URL: http://wyw.dcweb.cn/
Re: GB18030 != CP936 (Alternative project?)
Tony Mechelynck wrote: Here is an alternative way to handle it, which may be the right way from a conceptual point of view, and in the long term, though it may be much more difficult from the coding point of view. It may or may not be the right thing to do pragmatically: Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert when reading and writing, just like we already do for UTF-16le, UTF-16be, UTF-32le and UTF-32be. This, of course, also suffers from the performance problems related to conversion GB18030 = UTF-8. Converting various Unicode encodings to and from UTF-8 is trivial. Conversion between GB18030 and UTF-8 requires iconv. This is a huge difference. Also because the conversion may fail. If we go this way it's probably better not to use tricks and explicitly set 'encoding' to utf-8. One would need to try this out to discover any problems, e.g. with menus. Try Motif: GTK is utf-8 based thus Motif is more of a challange. -- Spam seems to be something useful to novices. Later you realize that it's a bunch of indigestable junk that only clogs your system. Applies to both the food and the e-mail! /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org///
Re: GB18030 != CP936 (Alternative project?)
A.J.Mechelynck wrote: Yongwei Wu wrote: Hi Tony, On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: Yongwei Wu wrote: [...] If your purpose is only to provide a workaround for LANG=zh_CN.GB18030, changing the environment variable inside main() of Vim may be a better approach. Best regards, Yongwei ... and, if the Chinese messages and menus _actually_ used by (g)vim don't use any GB18030 4-byte codepoints, it might even work. But only experiment will prove that. If they do use some 4-byte codepoints (which are supposed to be rare -- not less numerous than 1- and 2-byte codepoints but less commonly used), maybe synonyms or periphrases can be devised? No, I can guarantee that. I believe only Chinese linguists (or people that will need to process very strange person names) will have chance to use Chinese characters that cannot be encoded in two bytes :-). Other cases that people want to use GB18030 include using non-Chinese characters/symbols in a GBK-compatible encoding. Best regards, Yongwei OK, so let me explain what I suggest and you try to poke holes in it. Since Vim doesn't support more-than-2-byte encodings (other than UTF-8 and UCS-32) natively, we cannot set 'encoding' to GB18030. So what shall we do if the locale encoding is set to GB18030 at startup? (Am I correct in assuming that zh_CN.GB18030 is the normal locale setting in the PRC nowadays?) Setting $LANG to use GBK instead in main() will mean that any menus and messages, if written without 4-byte codepoints (and menus, AFAIK, do not include proper names or archaic characters) will display correctly. Later (i.e., in the vimrc), and with the proper safeguards, we can do :if tenc == | let tenc = enc | endif :set enc=utf-8 :set fencs=ucs-bom,utf-8,gb18030,cp1252 or something similar :setglobal fenc=gb18030 and GB18030 files (even with rare proper names or interspersed Cyrillic text) will be read and written correctly (and, IIUC, so will variable parts of messages containing not translations but literals, but only in gvim). I understand that the conversion GB18030 = UTF-8 is one-to-one but not necessarily fast, and requires a huge conversion table; but IIUC the iconv library can do it. Apart from this performance question, and from the fact that I deliberately omitted any mention of your new encoding-detection package, do you think the above holds water? Best regards, Tony. Here is an alternative way to handle it, which may be the right way from a conceptual point of view, and in the long term, though it may be much more difficult from the coding point of view. It may or may not be the right thing to do pragmatically: Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert when reading and writing, just like we already do for UTF-16le, UTF-16be, UTF-32le and UTF-32be. This, of course, also suffers from the performance problems related to conversion GB18030 = UTF-8. Best regards, Tony. -- Love and scandal are the best sweeteners of tea.
Re: GB18030 != CP936 (Alternative project?)
Hi Tony, On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: [...] Here is an alternative way to handle it, which may be the right way from a conceptual point of view, and in the long term, though it may be much more difficult from the coding point of view. It may or may not be the right thing to do pragmatically: Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert when reading and writing, just like we already do for UTF-16le, UTF-16be, UTF-32le and UTF-32be. There is still another problem. When using gvim under Windoze with CP936 locale, we can only set the encoding to CP936. Or the messages in gvim will become malformed characters. Could anybody offer a good solution to this problem? This, of course, also suffers from the performance problems related to conversion GB18030 = UTF-8. Best regards, Tony. -- Love and scandal are the best sweeteners of tea. Regards, Edward Leap Fox
Re: GB18030 != CP936 (Alternative project?)
Edward L. Fox wrote: Hi Tony, On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: [...] Here is an alternative way to handle it, which may be the right way from a conceptual point of view, and in the long term, though it may be much more difficult from the coding point of view. It may or may not be the right thing to do pragmatically: Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert when reading and writing, just like we already do for UTF-16le, UTF-16be, UTF-32le and UTF-32be. There is still another problem. When using gvim under Windoze with CP936 locale, we can only set the encoding to CP936. Or the messages in gvim will become malformed characters. Could anybody offer a good solution to this problem? Yes, this is a /different/ problem. Gvim does support CP936 natively. But of course we may want to handle files containing non-CP936 data and that means switching encodings. The command-line messages must of course be output in the current 'encoding', whatever it is; but what about the menus? In the gvim 'encoding' or in the locale encoding? (My memory is hazy on this point.) It may be as simple as selecting the right sequence for sourcing the menus, changing 'enc', and setting :lang mess to the new encoding, but there may be edge cases. This, of course, also suffers from the performance problems related to conversion GB18030 = UTF-8. Best regards, Tony. -- Love and scandal are the best sweeteners of tea. Regards, Edward Leap Fox Best regards, Tony. -- I don't know anything about music. In my line you don't have to. -- Elvis Presley
Re: GB18030 != CP936 (Alternative project?)
Hello Edward, Tuesday, February 27, 2007, 11:58:30 AM, you wrote: ?Hi Tony, ?On 2/27/07, A.J.Mechelynck [EMAIL PROTECTED] wrote: ?[...] ?Here is an alternative way to handle it, which may be the right way from a ?conceptual point of view, and in the long term, though it may be much more ?difficult from the coding point of view. It may or may not be the right thing ?to do pragmatically: ?Treat GB18030 as what it is, namely, a Unicode Transformation Format. In other ?words, whenever 'encoding' is set to GB18030, use UTF-8 internally and convert ?when reading and writing, just like we already do for UTF-16le, UTF-16be, ?UTF-32le and UTF-32be. ?There is still another problem. When using gvim under Windoze with ?CP936 locale, we can only set the encoding to CP936. Or the messages ?in gvim will become malformed characters. Could anybody offer a good ?solution to this problem? ?This, of course, also suffers from the performance problems related to ?conversion GB18030 = UTF-8. ?Best regards, ?Tony. ?-- ?Love and scandal are the best sweeteners of tea. ?Regards, ?Edward Leap Fox I use these settings: set encoding=utf-8 set langmenu=zh_CN.utf-8 this must be set before syntax on set helplang=cn language message zh_CN.utf-8 -- Best regards, mbbillmailto:[EMAIL PROTECTED]