using regexp to search for Unicode code points and properties

Brian Anderson Thu, 20 Aug 2009 07:28:41 -0700

I'm interested in learning how to use regular expressions in Vi(m) to 
search for Unicode code points.


In a book about regexp, it describes how to search for Unicode code 
points by various means, and for various programming languages.

The book describes searching for a specific Unicode code point as \u2122 
or \x{2122}.

 From what I've seen in the Vim help files, \u is to identify uppercase 
characters, not Unicode code points, and \x is for hexadecimal digits.

The book also talks about  using Unicode property or categories in the 
search. The book indicates there are 30 Unicode categories, grouped into 
7 super-categories.
For example, \p{Ll} would find any lowercase letter that has an 
uppercase variant, and \p{Lo} any letter or ideograph that does not have 
lowercase and uppercase variants.

Unicode blocks are defined as \p{IsGreekExtended}. Blocks consist of a 
single range of code points. Example: searching for any code point 
between U+0000...U+007F can be found with \p{InBasicLatin}.

Unicode script is \p{Greek}. Each Unicode code point is part of only one 
Unicode script. So if I wanted to search for any Greek letter, I'd use 
\p{Greek}.

Unicode grapheme is \X or \P{M}. This would be either single codepoints 
(U+00E0 Latin small letter a with grave accent) or combined codepoints 
(U+0061 Latin small letter a + U+0300 combining grave accent).

Help on any of these, either in examples or where to look in the help 
files, welcome.

Thanks.

Brian

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

using regexp to search for Unicode code points and properties

Reply via email to