On Fri, 26 Nov 2010 07:27:02 -0500, Simon Slavin <slav...@bigfraud.org>  
wrote:

> On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote:
>
>> You are right of course. The shell should not count code points, but  
>> graphemes.
>>
>> http://unicode.org/faq/char_combmark.html#7
>>
[snip]
>> Or would it be possible to write such a graphemelen(s) function in not  
>> too many
>> lines of C code without needing any external Unicode libraries?
>
> No.  Sorry, but Unicode was not designed to make it simple to figure out  
> such a function.  You need lots of data to figure out how the compound  
> characters work.

“Lots of data” can still be represented efficiently:

http://www.strchr.com/multi-stage_tables
(I am not affiliated with that site in any way.)

Such coding tricks seem more usually used for case-folding tables, script  
identification, and so forth; but I don’t see why the same principles  
couldn’t be used for all Unicode properties, including the combiner stuff.

You don’t need ICU or a similar monstrosity to get at Unicode properties.   
Big, heavy libraries will help you support CLDR, different collations for  
every language, calendrical calculations and conversions, and so on, and  
so forth.  Excluding Unihan, basic Unicode-property lookups should compile  
down much lighter in weight than SQLite itself.

N.b., there is a severe bug (pointers calculated based on truncated 16-bit  
values above plane-0) in a popular Unicode-properties SQLite extension.   
The extension only attempts covering a few high-plane characters—if memory  
serves, three of them in array 198; but with the high-bits snipped off, I  
rather doubt those will be what is actually affected.  I attempted  
contacting the author about the bug last year when I discovered it, but  
was unable to find a private contact method on a brief glance through the  
author’s site.  Perhaps the bug has been fixed by now; I never checked  
back; anyone who intelligently investigates compiler warnings would not be  
bitten anyway.  I write off the whole episode as a victory for spammers.

Very truly,

Samuel Adam <a...@certifound.com>
763 Montgomery Road
Hillsborough, NJ  08844-1304
United States
http://certifound.com/
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to