Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column?mode does not align UTF8-strings correctly
On Fri, Nov 26, 2010 at 06:52:56AM +, Niklas Bäckman wrote: > Igor Tandetnik writes: > > Note that counting codepoints, while it happens to help with your > > particular data, won't help in general. Consider combining > > diacritics: U+00E4 (small A with diaeresis) looks the same as U+0061 > > U+0308 (small letter A + combining diaeresis) when printed on the > > console. > > You are right of course. The shell should not count code points, but > graphemes. And their widths. Nico -- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly
At 14:26 26/11/2010, you wrote: >N.b., there is a severe bug (pointers calculated based on truncated >16-bit >values above plane-0) in a popular Unicode-properties SQLite extension. >The extension only attempts covering a few high-plane charactersif >memory >serves, three of thhem in array 198; but with the high-bits snipped >off, I >rather doubt those will be what is actually affected. I attempted >contacting the author about the bug last year when I discovered it, but >was unable to find a private contact method on a brief glance through >the >authorâs site. Perhaps the bug has been fixed by now; I never checked >back; anyone who intelligently investigates compiler warnings would >not be >bitten anyway. I write off the whole episode as a victory for spammers. I believe you refer to Ioannis code. I found this 16-bit truncation and decided to expand that trie to 32-bit in order to support those characters correctly. As I had many several distinct needs (still highly related to Unicode) I decided to rewrite most of the code and expand it in a number of directions. Anyone interested can contact me so I post the source. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly
On Fri, 26 Nov 2010 07:27:02 -0500, Simon Slavin wrote: > On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote: > >> You are right of course. The shell should not count code points, but >> graphemes. >> >> http://unicode.org/faq/char_combmark.html#7 >> [snip] >> Or would it be possible to write such a graphemelen(s) function in not >> too many >> lines of C code without needing any external Unicode libraries? > > No. Sorry, but Unicode was not designed to make it simple to figure out > such a function. You need lots of data to figure out how the compound > characters work. “Lots of data” can still be represented efficiently: http://www.strchr.com/multi-stage_tables (I am not affiliated with that site in any way.) Such coding tricks seem more usually used for case-folding tables, script identification, and so forth; but I don’t see why the same principles couldn’t be used for all Unicode properties, including the combiner stuff. You don’t need ICU or a similar monstrosity to get at Unicode properties. Big, heavy libraries will help you support CLDR, different collations for every language, calendrical calculations and conversions, and so on, and so forth. Excluding Unihan, basic Unicode-property lookups should compile down much lighter in weight than SQLite itself. N.b., there is a severe bug (pointers calculated based on truncated 16-bit values above plane-0) in a popular Unicode-properties SQLite extension. The extension only attempts covering a few high-plane characters—if memory serves, three of them in array 198; but with the high-bits snipped off, I rather doubt those will be what is actually affected. I attempted contacting the author about the bug last year when I discovered it, but was unable to find a private contact method on a brief glance through the author’s site. Perhaps the bug has been fixed by now; I never checked back; anyone who intelligently investigates compiler warnings would not be bitten anyway. I write off the whole episode as a victory for spammers. Very truly, Samuel Adam 763 Montgomery Road Hillsborough, NJ 08844-1304 United States http://certifound.com/ ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly
On 26 Nov 2010, at 6:52am, Niklas Bäckman wrote: > You are right of course. The shell should not count code points, but > graphemes. > > http://unicode.org/faq/char_combmark.html#7 > > I guess that this probably falls out of the "lite" scope of SQLITE though? There is absolutely no way you're going to get graphemes into the SQLite library until the SQLite library is written to support Unicode in other ways (which it currently doesn't). The command-line tool could possibly have grapheme-counting added to it, though. The 'lite' in 'SQLite' only has to refer to the routines people need to compile into their applications; there's no need to keep an external tool slim. > Or would it be possible to write such a graphemelen(s) function in not too > many > lines of C code without needing any external Unicode libraries? No. Sorry, but Unicode was not designed to make it simple to figure out such a function. You need lots of data to figure out how the compound characters work. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly
Niklas Bäckman wrote: > Columns with special characters like ("å" "ä" "å") get too short widths when > output. > > I guess this is due to the shell not counting actual UTF8 *characters/code > points* when calculating the widths, but instead only > counting the plain bytes in the strings, so they will seem longer until they > are actually printed to the console. Note that counting codepoints, while it happens to help with your particular data, won't help in general. Consider combining diacritics: U+00E4 (small A with diaeresis) looks the same as U+0061 U+0308 (small letter A + combining diaeresis) when printed on the console. -- Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] SQLITE 3.7.3 bug report (shell) - output in column mode does not align UTF8-strings correctly
There's a problem aligning columns containing UTF-8 (non-ASCII) characters correctly when output is done in column mode. The .width was set as ".width 5 30 15 15 10 30 10" and here's an example of the incorrect output: (Maybe paste into editor with fixed width font to better see the error) === BookI Title Last NameFirst Name Date Read Original Title Story Titl - -- --- --- -- -- -- 20 Bakom fasaden Sandemo Margit 0 20 Bakom fasaden Sandemo Margit 2010-07-02 13 Blodshämnd Sandemo Margit 0 13 Blodshämnd Sandemo Margit 2010-06-19 10 Bödelns dotter Sandemo Margit 0 10 Bödelns dotter Sandemo Margit 2010-06-11 24 Demonen och jungfrunSandemo Margit 0 24 Demonen och jungfrunSandemo Margit 2010-07-08 43 Demonernas fjäll Sandemo Margit 0 43 Demonernas fjäll Sandemo Margit 2010-07-25 === Columns with special characters like ("å" "ä" "å") get too short widths when output. I guess this is due to the shell not counting actual UTF8 *characters/code points* when calculating the widths, but instead only counting the plain bytes in the strings, so they will seem longer until they are actually printed to the console. The output was done to a Windows console with code page set to 65001. -- Niklas ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users