Noah Spurrier wrote:
I'm trying to edit a utf-8 document in Vim in a vt-102 terminal
(in other words, not Gvim). There are a few multi-byte unicode characters
in the text that behave erraticly when I cursor over them.
I don't need these multi-byte characters, so I'm happy to just
delete them rather than try to edit them in gvim.

How can I search for and delete utf-8 multi-byte character sequences?

I have both two-byte and three-byte utf-8 characters to consider.
I'm using vim7.

Yours,
Noah





1. You must have 'encoding' set to UTF-8. Note that this will change three things: (a) how Vim understands the contents of the file, (b) how it displays it, (c) how it understands your keypresses. (c) is also governed by 'termencoding' but in Console Vim (not gvim) this governs (b) too. If 'encoding' is not set to UTF-8 when entering Vim (and before any vimrc commands) you may need to set 'termencoding' to the "old" value of 'encoding' before changing 'encoding'. But in console Vim the display (which is controlled by the terminal, not directly by Vim) may or may not be garbled. Using gvim is easier -- unless you are on a Unix/Linux system with no X server running. 2. To delete a composing character while leaving the underlying spacing character unchanged, make sure 'delcombine' is TRUE and use backspace in Insert mode. This deletes one combining character at a time; in some cases, the same character may be "combining" or "spacing" depending on context: e.g., Vim (with +arabic, and 'arabic' set) treats Arabic alif as "combining" when it immediately follows laam, as "spacing" in other cases. 3. To delete the spacing character together with any combining characters it may have, use (for instance) x in Normal mode 4. Multi-byte characters in UTF-8 are anything greater than 127. You can search for codepoints between U+0080 and U+00FF by searching on

        /[\x80-\xFF]

IIUC, there's no way to search for a _range_ of characters higher than U+00FF. Or maybe you can "trick" the system by searching on

        /[^\x00-\x7F]

but I haven't tested it.

WARNING: Any Latin accented characters, even à é ç etc., are multi-byte in UTF-8. Also the degree sign etc. So I guess if you want to do it using the ":s[ubstitute]" command you had better set the c flag.


Best regards,
Tony.

Reply via email to