On 07/23/2014 03:23 PM, Eric Muller wrote: > I would like to work with the exemplarCharacters data in the CLDR. > That uses the UnicodeSet notation. Is there somewhere a parser for > that notation, that would return me just the list of characters in the > set? Something a bit like the UnicodeSet utility at > <http://unicode.org/cldr/utility/list-unicodeset.jsp>, but for use in > apps/shell. > > I suspect that the exemplarCharacters use a restricted form of the > UnicodeSet notation (e.g. do not use property values). Is that > correct, and if so, what's the subset? > > Incidentally, I copy/pasted the punctuation exemplar characters for > he.xml into the utility, and it reported that the set contains 8,130 > code points, including the ascii letters. Somehow, that seems > incorrect. What did I do wrong? > > Thanks, > Eric. >
Eric, UnicodeSet is a class available in ICU4J and ICU4C/C++ and so you can parse and query using the ICU API. I wrote a little command line utility badly named "ucd" that is similar to the web page mentioned above. It is here: http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/ and here is the readme: http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/readme.txt let me know what platform you are on and I can send you build instructions. -s -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl _______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

