Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column?mode does not align UTF8-strings correctly

2010-11-29 Thread Nicolas Williams
On Fri, Nov 26, 2010 at 06:52:56AM +, Niklas Bäckman wrote:
> Igor Tandetnik  writes:
> > Note that counting codepoints, while it happens to help with your
> > particular data, won't help in general.  Consider combining
> > diacritics: U+00E4 (small A with diaeresis) looks the same as U+0061
> > U+0308 (small letter A + combining diaeresis) when printed on the
> > console.
> 
> You are right of course. The shell should not count code points, but 
> graphemes.

And their widths.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly

2010-11-26 Thread Jean-Christophe Deschamps
At 14:26 26/11/2010, you wrote:

>N.b., there is a severe bug (pointers calculated based on truncated 
>16-bit
>values above plane-0) in a popular Unicode-properties SQLite extension.
>The extension only attempts covering a few high-plane characters—if 
>memory
>serves, three of thhem in array 198; but with the high-bits snipped 
>off, I
>rather doubt those will be what is actually affected.  I attempted
>contacting the author about the bug last year when I discovered it, but
>was unable to find a private contact method on a brief glance through 
>the
>author’s site.  Perhaps the bug has been fixed by now; I never checked
>back; anyone who intelligently investigates compiler warnings would 
>not be
>bitten anyway.  I write off the whole episode as a victory for spammers.

I believe you refer to Ioannis code.  I found this 16-bit truncation 
and decided to expand that trie to 32-bit in order to support those 
characters correctly.  As I had many several distinct needs (still 
highly related to Unicode) I decided to rewrite most of the code and 
expand it in a number of directions.  Anyone interested can contact me 
so I post the source.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly

2010-11-26 Thread Samuel Adam
On Fri, 26 Nov 2010 07:27:02 -0500, Simon Slavin   
wrote:

> On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote:
>
>> You are right of course. The shell should not count code points, but  
>> graphemes.
>>
>> http://unicode.org/faq/char_combmark.html#7
>>
[snip]
>> Or would it be possible to write such a graphemelen(s) function in not  
>> too many
>> lines of C code without needing any external Unicode libraries?
>
> No.  Sorry, but Unicode was not designed to make it simple to figure out  
> such a function.  You need lots of data to figure out how the compound  
> characters work.

“Lots of data” can still be represented efficiently:

http://www.strchr.com/multi-stage_tables
(I am not affiliated with that site in any way.)

Such coding tricks seem more usually used for case-folding tables, script  
identification, and so forth; but I don’t see why the same principles  
couldn’t be used for all Unicode properties, including the combiner stuff.

You don’t need ICU or a similar monstrosity to get at Unicode properties.   
Big, heavy libraries will help you support CLDR, different collations for  
every language, calendrical calculations and conversions, and so on, and  
so forth.  Excluding Unihan, basic Unicode-property lookups should compile  
down much lighter in weight than SQLite itself.

N.b., there is a severe bug (pointers calculated based on truncated 16-bit  
values above plane-0) in a popular Unicode-properties SQLite extension.   
The extension only attempts covering a few high-plane characters—if memory  
serves, three of them in array 198; but with the high-bits snipped off, I  
rather doubt those will be what is actually affected.  I attempted  
contacting the author about the bug last year when I discovered it, but  
was unable to find a private contact method on a brief glance through the  
author’s site.  Perhaps the bug has been fixed by now; I never checked  
back; anyone who intelligently investigates compiler warnings would not be  
bitten anyway.  I write off the whole episode as a victory for spammers.

Very truly,

Samuel Adam 
763 Montgomery Road
Hillsborough, NJ  08844-1304
United States
http://certifound.com/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly

2010-11-26 Thread Simon Slavin

On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote:

> You are right of course. The shell should not count code points, but 
> graphemes.
> 
> http://unicode.org/faq/char_combmark.html#7
> 
> I guess that this probably falls out of the "lite" scope of SQLITE though?

There is absolutely no way you're going to get graphemes into the SQLite 
library until the SQLite library is written to support Unicode in other ways 
(which it currently doesn't).

The command-line tool could possibly have grapheme-counting added to it, 
though.  The 'lite' in 'SQLite' only has to refer to the routines people need 
to compile into their applications; there's no need to keep an external tool 
slim.

> Or would it be possible to write such a graphemelen(s) function in not too 
> many
> lines of C code without needing any external Unicode libraries?

No.  Sorry, but Unicode was not designed to make it simple to figure out such a 
function.  You need lots of data to figure out how the compound characters work.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly

2010-11-25 Thread Igor Tandetnik
Niklas Bäckman  wrote:
> Columns with special characters like ("å" "ä" "å") get too short widths when
> output.
> 
> I guess this is due to the shell not counting actual UTF8 *characters/code
> points* when calculating the widths, but instead only
> counting the plain bytes in the strings, so they will seem longer until they
> are actually printed to the console.

Note that counting codepoints, while it happens to help with your particular 
data, won't help in general. Consider combining diacritics: U+00E4 (small A 
with diaeresis) looks the same as U+0061 U+0308 (small letter A + combining 
diaeresis) when printed on the console.
-- 
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users