Re: How to delete unicode multi-byte characters?

A.J.Mechelynck Tue, 08 Aug 2006 22:08:38 -0700

Noah Spurrier wrote:

I'm trying to edit a utf-8 document in Vim in a vt-102 terminal
(in other words, not Gvim). There are a few multi-byte unicode characters
in the text that behave erraticly when I cursor over them.
I don't need these multi-byte characters, so I'm happy to just
delete them rather than try to edit them in gvim.


How can I search for and delete utf-8 multi-byte character sequences?

I have both two-byte and three-byte utf-8 characters to consider.
I'm using vim7.

Yours,
Noah

1. You must have 'encoding' set to UTF-8. Note that this will changethree things: (a) how Vim understands the contents of the file, (b) howit displays it, (c) how it understands your keypresses. (c) is alsogoverned by 'termencoding' but in Console Vim (not gvim) this governs(b) too. If 'encoding' is not set to UTF-8 when entering Vim (and beforeany vimrc commands) you may need to set 'termencoding' to the "old"value of 'encoding' before changing 'encoding'. But in console Vim thedisplay (which is controlled by the terminal, not directly by Vim) mayor may not be garbled. Using gvim is easier -- unless you are on aUnix/Linux system with no X server running.2. To delete a composing character while leaving the underlying spacingcharacter unchanged, make sure 'delcombine' is TRUE and use backspace inInsert mode. This deletes one combining character at a time; in somecases, the same character may be "combining" or "spacing" depending oncontext: e.g., Vim (with +arabic, and 'arabic' set) treats Arabic alifas "combining" when it immediately follows laam, as "spacing" in othercases.3. To delete the spacing character together with any combiningcharacters it may have, use (for instance) x in Normal mode4. Multi-byte characters in UTF-8 are anything greater than 127. You cansearch for codepoints between U+0080 and U+00FF by searching on


        /[\x80-\xFF]

IIUC, there's no way to search for a _range_ of characters higher thanU+00FF. Or maybe you can "trick" the system by searching on


        /[^\x00-\x7F]

but I haven't tested it.

WARNING: Any Latin accented characters, even à é ç etc., are multi-bytein UTF-8. Also the degree sign etc. So I guess if you want to do itusing the ":s[ubstitute]" command you had better set the c flag.



Best regards,
Tony.

Re: How to delete unicode multi-byte characters?

Reply via email to