On Fri, 26 Nov 2010 07:27:02 -0500, Simon Slavin <slav...@bigfraud.org> wrote:
> On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote: > >> You are right of course. The shell should not count code points, but >> graphemes. >> >> http://unicode.org/faq/char_combmark.html#7 >> [snip] >> Or would it be possible to write such a graphemelen(s) function in not >> too many >> lines of C code without needing any external Unicode libraries? > > No. Sorry, but Unicode was not designed to make it simple to figure out > such a function. You need lots of data to figure out how the compound > characters work. “Lots of data” can still be represented efficiently: http://www.strchr.com/multi-stage_tables (I am not affiliated with that site in any way.) Such coding tricks seem more usually used for case-folding tables, script identification, and so forth; but I don’t see why the same principles couldn’t be used for all Unicode properties, including the combiner stuff. You don’t need ICU or a similar monstrosity to get at Unicode properties. Big, heavy libraries will help you support CLDR, different collations for every language, calendrical calculations and conversions, and so on, and so forth. Excluding Unihan, basic Unicode-property lookups should compile down much lighter in weight than SQLite itself. N.b., there is a severe bug (pointers calculated based on truncated 16-bit values above plane-0) in a popular Unicode-properties SQLite extension. The extension only attempts covering a few high-plane characters—if memory serves, three of them in array 198; but with the high-bits snipped off, I rather doubt those will be what is actually affected. I attempted contacting the author about the bug last year when I discovered it, but was unable to find a private contact method on a brief glance through the author’s site. Perhaps the bug has been fixed by now; I never checked back; anyone who intelligently investigates compiler warnings would not be bitten anyway. I write off the whole episode as a victory for spammers. Very truly, Samuel Adam <a...@certifound.com> 763 Montgomery Road Hillsborough, NJ 08844-1304 United States http://certifound.com/ _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users