Re: [sqlite] Unicode support in SQLite
On 14/10/14 17:02, Kevin Benson wrote: https://bitbucket.org/alekseyt/nunicode/downloads/libnusqlite3-1.4-4a0e4773-win32.zip <--- 404 response code Thank you, fixed now. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support in SQLite
On Tue, Oct 14, 2014 at 4:37 AM, Aleksey Tulinov wrote: > Hello, > > I'm glad to announce that nunicode SQLite extension was updated to support > Unicode-conformant case folding and was improved on performance of every > component provided to SQLite. > > You can read about and download this extension at BitBucket page of > nunicode library: https://bitbucket.org/alekseyt/nunicode#markdown- > header-sqlite3-extension > > This extension provides the following Unicode-aware components: > > - upper(X) > - lower(X) > - X LIKE Y ESCAPE Z > - COLLATE NU700 : case-sensitive Unicode 7.0.0 collation > - COLLATE NU700_NOCASE : case-insensitive Unicode 7.0.0 collation > https://bitbucket.org/alekseyt/nunicode/downloads/libnusqlite3-1.4-4a0e4773-win32.zip <--- 404 response code -- -- -- --Ô¿Ô-- K e V i N ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support in SQLite
Hello, I'm glad to announce that nunicode SQLite extension was updated to support Unicode 7.0.0 character set. It also implements LIKE operation which is faster compared to previous releases. This extension provides the following Unicode-aware components: - upper(X) - lower(X) - X LIKE Y ESCAPE Z - COLLATE NU700 : case-sensitive Unicode 7.0.0 collation - COLLATE NU700_NOCASE : case-insensitive Unicode 7.0.0 collation Collation functions implement default Unicode collation (based on DUCET). Previously implemented Unicode 6.3.0 collations NU630 and NU630_NOCASE were removed from this version of extension. You can find implementation details, changelog and downloads at BitBucket page of nunicode library: https://bitbucket.org/alekseyt/nunicode ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support in SQLite
Hey, According to previous discussion in this mailing list, i've updated nunicode SQLite extension not to override default NOCASE collation due to possible issues with database indexing. Version 1.2.1 removes nunicode-specific NOCASE and NUNICODE collations and introduces NU630 and NU630_NOCASE collations instead. First is case-sensitive Unicode 6.3.0 collation, second is case-insensitive, both implements default Unicode collation ordering (DUCET). In all other regards, it's not different from 1.2 version of extension and based on the same nunicode 1.2. Full changelog is available here: https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG Pre-compiled extensions are available under "Downloads" for Win32 and i386/amd64 Linux. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support in SQLite
Very nice! Thanks for sharing, Aleksey. 2013/11/9 Aleksey Tulinov > On 11/04/2013 11:50 AM, Aleksey Tulinov wrote: > > Hey, > > > As you can see, this is truly full Unicode collation and case mapping >> with untailored special casing. Extension provides the following functions, >> statements and collations: >> > > I've updated extension, examples and documentation, now it's easier to > link extension statically. Everything, including new prebuilt binaries, is > available on BitBucket, changelog is available here: > https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support in SQLite
On 11/04/2013 11:50 AM, Aleksey Tulinov wrote: Hey, As you can see, this is truly full Unicode collation and case mapping with untailored special casing. Extension provides the following functions, statements and collations: I've updated extension, examples and documentation, now it's easier to link extension statically. Everything, including new prebuilt binaries, is available on BitBucket, changelog is available here: https://bitbucket.org/alekseyt/nunicode/src/master/CHANGELOG ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
On Tue, Nov 17, 2009 at 09:31:46PM -0500, Tim Romano wrote: > but if ORDER BY is > relying on an index for ordering, then flip() can have negative > effects. > > > Substr() could have negative effects on ordering too. That is a red > herring. Flip() is merely a function that reverses the order of > codepoints "as found" without knowing anything about what those > codepoints, individually or in combination, might signify in a writing > system. If I want to write those codepoints to a column that's my concern. In Unicode there's codepoints, characters, and glyphs. Codepoints are single 21-bit values. Characters are either single codepoints or combinations of codepoints. Glyphs are either single characters or combinations of characters that are displayed as single programatically-constructed glyphs. SQLite3 knows about none of that. Nor about normalization forms. Therefore any functions like substr() and flip() that work at the codepoint level (or worse, at the byte level, but fortunately substr() is UTF-8/16 aware) can break semantics for your strings. > What if I wanted to have a column that consisted of codepoints from all > over the Unicode range: a codepoint from Greek next to a codepoint from > Swahili next to a codepoint from Hungarian? Shouldn't I be able to say > to a database: this column contains codepoints (characters) and > collation is not relevant, sort the column using the numeric value of > the codepoints? Yes, I think so. I'm not sure why you'd want that, but yes, it ought to be possible, and right now SQLite3 lets you do that because it is not aware of characters and glyphs -- SQLite3 is aware of only codepoints. But if you load the ICU extensions that might change! Ideally there should be a way to indicate a variety of Unicode-related behaviors: - normalization form for use in index keys - normalization-insensitive string comparison operators - whether to normalize values in tables and, if so, with what form (by column, obviously) - if you normalize strings in index keys but not in tables then you get normalization-insensitive-but-normalization-preserving behavior, which is really, really convenient - collation options, such as language - whether to honor language tags embedded in the UTF-8/16 strings - multiple text types? (string of codepoints, of characters, or glyphs) - a whole range of Unicode-aware functions like substr() (and flip(), and like(), and regex(), and glob(), ...), with options for character and glyph counting instead of codepoint counting - codesets (for non-Unicode data), with automatic codeset conversions similar to type conversions - to have automatic conversions I think would require an extensible text type system That's... a lot of functionality. I'm not sure how much of it needs to be implemented with help from the SQLite3 core, versus extensions. It'd be nice if all of it could be implemented via extensions, but I don't think that's possible right now. Nico -- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
but if ORDER BY is relying on an index for ordering, then flip() can have negative effects. Substr() could have negative effects on ordering too. That is a red herring. Flip() is merely a function that reverses the order of codepoints "as found" without knowing anything about what those codepoints, individually or in combination, might signify in a writing system. If I want to write those codepoints to a column that's my concern. What if I wanted to have a column that consisted of codepoints from all over the Unicode range: a codepoint from Greek next to a codepoint from Swahili next to a codepoint from Hungarian? Shouldn't I be able to say to a database: this column contains codepoints (characters) and collation is not relevant, sort the column using the numeric value of the codepoints? Tim Romano Nicolas Williams wrote: > On Tue, Nov 17, 2009 at 05:15:16PM -0500, Igor Tandetnik wrote: > >> Nicolas Williams wrote: >> >>> This is no longer true, either of 'ch' nor 'll'. >>> >> There is a number of contractions in Hungarian that are still very >> much in use, but I can't recall them off the top of my head the way I >> can 'ch' (it's something like 'dzs'). There are also contractions in >> German Phonebook sort (e.g. 'oe' should sort between 'o with umlaut' >> and 'p', if I recall correctly). There are likely other cases. >> > > I'm not surprised :( > > >>> The principle you >>> state is correct, of course, but really, this is a collation problem, >>> and affects SQLite3 apps regardless of "flip()". >>> >> My point is, it's difficult to even define what the correct behavior >> of flip() should be, let alone implement one. And so the safest course >> of action is to leave it out of core SQLite: a developer in need of >> such a function would presumably know the nature of their data and >> precisely what they want the function to achieve, and can always >> implement it as a custom function. >> > > Maybe. For indexing, I don't see the harm as long as an index built > with this function isn't used for ORDER BY when you care about > collations (ah! SQLite3 couldn't tell this is happening without knowing > the semantics of the function). > > >>> The collation is >>> per-column, and the run-time should make functions aware of the >>> collation (if any) of a column when an argument. >>> >> What about >> >> select flip(EnglishText || GermanText || SpanishText) >> from MyMultilingualTable; >> > > No different than: > > select EnglishText || GermanText || SpanishText from MyMultilingualTable; > > the concatenation can create 'oe' and all those other whatever they are > called's. > > This is OK until you ORDER BY, and _then_ the collation requested or > inferred needs to apply. Ah, there should be no inference of collation > from function names, and functions shouldn't have to care about > collations "in effect" -- only ORDER BY should care, but if ORDER BY is > relying on an index for ordering, then flip() can have negative effects. > > Nico > > > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.425 / Virus Database: 270.14.71/2510 - Release Date: 11/17/09 > 19:26:00 > > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
Tim, >For those who are insisting on Unicode graphemic codepoint-combination >intelligence: why can't we have a function that simply reverses the >order of the codepoints, and is blissfully ignorant about what those >individual codepoints or codepoint-combinations might signify as >graphemes in a writing system? The flip() function could be totally >naive about all that and be 100% deterministic. All I want is a way to >get the monadic codepoints of a text-affinity column in reverse order. I just wrote one for you, can you check you inbox? ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
For those who are insisting on Unicode graphemic codepoint-combination intelligence: why can't we have a function that simply reverses the order of the codepoints, and is blissfully ignorant about what those individual codepoints or codepoint-combinations might signify as graphemes in a writing system? The flip() function could be totally naive about all that and be 100% deterministic. All I want is a way to get the monadic codepoints of a text-affinity column in reverse order. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
On Tue, Nov 17, 2009 at 05:15:16PM -0500, Igor Tandetnik wrote: > Nicolas Williams wrote: > > This is no longer true, either of 'ch' nor 'll'. > > There is a number of contractions in Hungarian that are still very > much in use, but I can't recall them off the top of my head the way I > can 'ch' (it's something like 'dzs'). There are also contractions in > German Phonebook sort (e.g. 'oe' should sort between 'o with umlaut' > and 'p', if I recall correctly). There are likely other cases. I'm not surprised :( > > The principle you > > state is correct, of course, but really, this is a collation problem, > > and affects SQLite3 apps regardless of "flip()". > > My point is, it's difficult to even define what the correct behavior > of flip() should be, let alone implement one. And so the safest course > of action is to leave it out of core SQLite: a developer in need of > such a function would presumably know the nature of their data and > precisely what they want the function to achieve, and can always > implement it as a custom function. Maybe. For indexing, I don't see the harm as long as an index built with this function isn't used for ORDER BY when you care about collations (ah! SQLite3 couldn't tell this is happening without knowing the semantics of the function). > > The collation is > > per-column, and the run-time should make functions aware of the > > collation (if any) of a column when an argument. > > What about > > select flip(EnglishText || GermanText || SpanishText) > from MyMultilingualTable; No different than: select EnglishText || GermanText || SpanishText from MyMultilingualTable; the concatenation can create 'oe' and all those other whatever they are called's. This is OK until you ORDER BY, and _then_ the collation requested or inferred needs to apply. Ah, there should be no inference of collation from function names, and functions shouldn't have to care about collations "in effect" -- only ORDER BY should care, but if ORDER BY is relying on an index for ordering, then flip() can have negative effects. Nico -- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
On 17 Nov 2009, at 10:05pm, Beau Wilkinson wrote: > I think a better approach (to the design of Unicode) would have been for > Spanish and German (for instance) to share absolutely nothing in the encoding > standards. Each language ought to have its own little span of letters, > immortalized into the standard in correct order-of-collation, with no sharing > of "code points," "characters," or anything else. This is how at least two unicode libraries I know of work internally. For all pieces of text they encounter they infer which language(s) this text represents. They then use whatever sort order is appropriate to that language. This requires you to assign language(s) to a string as the string is typed in, so that moving a database from one country to another does not change the collation order. If each piece of text is in the same language this does not require any space (well, just one 'default language' stored with the entire database file) but sometimes one string includes text from more than one country, e.g. switching from Roman to Japanese and back again. The only advantage to this system is that it works, and it works consistently. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
A few minutes ago I wrote that: >I think that as a general rule, the "combining" accents should be disregared >during collation. > > etc. I just read that "collation" page from Unicode.org and it seems to be completely at odds with what I suggested, e.g. in its insistence that some sequences of code points are "canonically equivalent." In light of this fact, I do not see how Unicode can ever really be considered "collated." And it follows that it cannot be reversed. At least, this is the case if one follows the advice at Unicode.org. The "collation" that Unicode.org seems to suggest is basically the invention of some academics. It does not seem to correspond to any human alphabet. Please, please correct me if I am wrong on this. I have never been one of those to just ignore Unicode. But I am starting to see that it does not really work so well in the real world once one leaves the realm of "ASCII-with-zeroes-on-top." From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] On Behalf Of Igor Tandetnik [itandet...@mvps.org] Sent: Tuesday, November 17, 2009 1:01 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] Unicode support Simon Slavin wrote: > On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote: > >> Simon Slavin wrote: >>> First split the string into characters, then reassemble them in >>> reverse order. >> >> The problem is, in Unicode it's not quite clear what constitutes a >> "character". Are we talking about codepoints, sort elements, >> graphemes? Depending on the application, either definition might >> make sense. > > I agree about the problem, but sort elements is the obvious answer in > this case. This would mean that the result of the hypothetical flip() function would be locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as if it were a single letter between 'c' and 'd', forming a single sort element (a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as a parameter to flip(), in order to achieve that? Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users The information contained in this e-mail is privileged and confidential information intended only for the use of the individual or entity named. If you are not the intended recipient, or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any disclosure, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and delete any copies from your system. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users The information contained in this e-mail is privileged and confidential information intended only for the use of the individual or entity named. If you are not the intended recipient, or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any disclosure, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and delete any copies from your system. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
Nicolas Williams wrote: > On Tue, Nov 17, 2009 at 02:01:55PM -0500, Igor Tandetnik wrote: >> This would mean that the result of the hypothetical flip() function >> would be locale-dependent. E.g. in Spanish Traditional sort, a >> combination 'ch' sorts as if it were a single letter between 'c' and >> 'd', forming a single sort element (a so-called contraction). So >> should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort, >> and to 'b hc a' otherwise? Would you pass a desired locale as a >> parameter to flip(), in order to achieve that? > > This is no longer true, either of 'ch' nor 'll'. There is a number of contractions in Hungarian that are still very much in use, but I can't recall them off the top of my head the way I can 'ch' (it's something like 'dzs'). There are also contractions in German Phonebook sort (e.g. 'oe' should sort between 'o with umlaut' and 'p', if I recall correctly). There are likely other cases. > The principle you > state is correct, of course, but really, this is a collation problem, > and affects SQLite3 apps regardless of "flip()". My point is, it's difficult to even define what the correct behavior of flip() should be, let alone implement one. And so the safest course of action is to leave it out of core SQLite: a developer in need of such a function would presumably know the nature of their data and precisely what they want the function to achieve, and can always implement it as a custom function. > The collation is > per-column, and the run-time should make functions aware of the > collation (if any) of a column when an argument. What about select flip(EnglishText || GermanText || SpanishText) from MyMultilingualTable; Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
>> On 17 Nov 2009, at 5:52pm, Igor Tandetnik wrote: >> >>> But for your goals, it has to be sortable, right? In a proper >>> Unicode collation, U+0041 U+0301 would behave quite differently from >>> U+0301 U+0041. Consider "A ' E" (where ' stands for a combining >>> acute accent). In most locales, this would sort between AE and BE. >>> Now, if we reverse it naively, we'll end up with "E ' A", with the >>> accent now attached to E and not A. The result would sort between EA >>> and FA, rather than between EA and EB as you would probably want. >> I think that as a general rule, the "combining" accents should be disregared during collation. For example: if a string contains the letter "a" plus a "combining acute accent," to me that seems like a hint that what we have is basically a letter "a," not a distinct letter with its own place in the collation sequence. This should be collated as an "a" that just happens to be accented, for whatever reason. In Spanish, for example, a diaresis is sometimes placed over the letter "U." This indicates that the preceding consonant is hard. It does not make the "U" into a different letter, or signficantly affect the collation sequence. (At most, it is a tie-breaker between two otherwise identical words.) So, I think the Spanish diaresis thus represents a legitimate use of the Uniciode "combining diaresis." In fact, I would submit that encoding Spanish's "U with diaresis" using code point U+00FC is just wrong, in the same way as coding letter "O" as ASCII 0x30 (zero) is wrong. We do not need to worry about cleaning up such a mistake in our collation code. In German, and the Scandinavian languages, the opposite is true. Putting a diaresis over a letter makes a new letter, which collates differently. "Combining accents" code points are not appropriate in these languages and their use should not be supported by a collation algorithm. Rather, these letters should be encoded using single code points. I think a better approach (to the design of Unicode) would have been for Spanish and German (for instance) to share absolutely nothing in the encoding standards. Each language ought to have its own little span of letters, immortalized into the standard in correct order-of-collation, with no sharing of "code points," "characters," or anything else. Unicode screws this up, as it does with so many things, and this is a big reason why it's widely reviled (or, ignored) by many programmers. This is editorial commentary, but I do not necessarily think it is irrelevant. I get the feeling that something better than Unicode must be brewing somewhere. Of course, sometimes bad standards have a life of their own, because they give us license to refuse to implement things and still look smart in so refusing. I suggest that this a very detrimental pattern, though. ________ From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] On Behalf Of Igor Tandetnik [itandet...@mvps.org] Sent: Tuesday, November 17, 2009 1:01 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] Unicode support Simon Slavin wrote: > On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote: > >> Simon Slavin wrote: >>> First split the string into characters, then reassemble them in >>> reverse order. >> >> The problem is, in Unicode it's not quite clear what constitutes a >> "character". Are we talking about codepoints, sort elements, >> graphemes? Depending on the application, either definition might >> make sense. > > I agree about the problem, but sort elements is the obvious answer in > this case. This would mean that the result of the hypothetical flip() function would be locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as if it were a single letter between 'c' and 'd', forming a single sort element (a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as a parameter to flip(), in order to achieve that? Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users The information contained in this e-mail is privileged and confidential information intended only for the use of the individual or entity named. If you are not the intended recipient, or the employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any disclosure, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and delete any copies from your system. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
On Tue, Nov 17, 2009 at 02:01:55PM -0500, Igor Tandetnik wrote: > This would mean that the result of the hypothetical flip() function > would be locale-dependent. E.g. in Spanish Traditional sort, a > combination 'ch' sorts as if it were a single letter between 'c' and > 'd', forming a single sort element (a so-called contraction). So > should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort, > and to 'b hc a' otherwise? Would you pass a desired locale as a > parameter to flip(), in order to achieve that? This is no longer true, either of 'ch' nor 'll'. The principle you state is correct, of course, but really, this is a collation problem, and affects SQLite3 apps regardless of "flip()". The collation is per-column, and the run-time should make functions aware of the collation (if any) of a column when an argument. Nico -- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Unicode support
Simon Slavin wrote: > On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote: > >> Simon Slavin wrote: >>> First split the string into characters, then reassemble them in >>> reverse order. >> >> The problem is, in Unicode it's not quite clear what constitutes a >> "character". Are we talking about codepoints, sort elements, >> graphemes? Depending on the application, either definition might >> make sense. > > I agree about the problem, but sort elements is the obvious answer in > this case. This would mean that the result of the hypothetical flip() function would be locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as if it were a single letter between 'c' and 'd', forming a single sort element (a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as a parameter to flip(), in order to achieve that? Igor Tandetnik ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
RE: [sqlite] Unicode support for Sqlite?
Thankyou all for the quick replies. Best Regards, A.Sreedhar. -Original Message- From: Trevor Talbot [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 5:08 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] Unicode support for Sqlite? On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote: > I am using the sqlite to store the metadata of audio files. > Is it possible to store the metadata in unicode character format in sqlite. Yes; SQLite assumes all TEXT type data in the database is Unicode. You can work with it in UTF-8 with the *_text() APIs, or UTF-16 using the *_text16() calls. SQLite will convert between the two encodings as necessary. The sqlite3 shell assumes UTF-8, but it depends on the platform's console to actually use UTF-8 when talking to it, so it may be difficult to properly test with it. - To unsubscribe, send email to [EMAIL PROTECTED] - - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode support for Sqlite?
On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote: > I am using the sqlite to store the metadata of audio files. > Is it possible to store the metadata in unicode character format in sqlite. Yes; SQLite assumes all TEXT type data in the database is Unicode. You can work with it in UTF-8 with the *_text() APIs, or UTF-16 using the *_text16() calls. SQLite will convert between the two encodings as necessary. The sqlite3 shell assumes UTF-8, but it depends on the platform's console to actually use UTF-8 when talking to it, so it may be difficult to properly test with it. - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode support for Sqlite?
utf-8 and utf-16 ARE unicode formats. But there are some things that sqlite does not handle without the ICU extension. The ICU extension extends SQLite with the following functionallity: 1.1 SQL Scalars upper() and lower() 1.2 Unicode Aware LIKE Operator 1.3 ICU Collation Sequences 1.4 SQL REGEXP Operator Download the SQLite source and have a look in the ext/icu directory Sreedhar.a wrote: Hi, Does Sqlite support unicode? I have seen that it supports utf-8 and utf-16. I want to know whether it supports unicode character formats. Thanks and Best Regards, A.Sreedhar. - To unsubscribe, send email to [EMAIL PROTECTED] -
RE: [sqlite] Unicode support for Sqlite?
Hi, I am using the sqlite to store the metadata of audio files. Is it possible to store the metadata in unicode character format in sqlite. Best Regards, A.Sreedhar. -Original Message- From: Trevor Talbot [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 4:40 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] Unicode support for Sqlite? On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote: > Does Sqlite support unicode? > I have seen that it supports utf-8 and utf-16. > I want to know whether it supports unicode character formats. Unicode is a very large and complex topic, so that question is way too vague to answer. Can you provide an example of what you're looking for? - To unsubscribe, send email to [EMAIL PROTECTED] - - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode support for Sqlite?
On 12/12/07, Sreedhar.a <[EMAIL PROTECTED]> wrote: > Does Sqlite support unicode? > I have seen that it supports utf-8 and utf-16. > I want to know whether it supports unicode character formats. Unicode is a very large and complex topic, so that question is way too vague to answer. Can you provide an example of what you're looking for? - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] UNICODE Support
Am 04.08.2006 um 19:23 schrieb Cory Nelson: I was not talking about sorting in my post - I've had simple = index comparisons fail in UTF-8. I'm pretty sure you can get the same kind of 'failure' when using UTF-16, e.g. when comparing decomposed against composed forms of unicode strings. Since sqlite only really does a 'binary' comparison, this may easily fail for non-ASCII strings. Also, there's a prominent warning in the documentation about working with case-insensitive comparison (since it only does it right for ASCII characters). Maybe this is where some more complete unicode support is most sorely missing, but it's probably beyond sqlite's scope to do proper unicode-savvy case shifting...?
Re: [sqlite] UNICODE Support
On Fri, Aug 04, 2006 at 10:02:58PM -0700, Cory Nelson wrote: > On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote: > >On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > > > >> But, since you brought it up - I have no expectations of SQLite > >> integrating a full Unicode locale library, however it would be a great > >> improvement if it would respect the current locale and use wcs* > >> functions when available, or at least order by standard Unicode order > >> instead of completely mangling things on UTF-8 codes. > > > >What do you mean by "standard Unicode order" in this context? > > > > Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely > correct) while sorting, to at least make them follow the same pattern. Huh? UTF-8 handled in the naive way (using "memcmp", like sqlite does) will automagically give you sorting by unicode codepoint (probably the only useful meaning of "standard Unicode order" here). UTF-16 handled in the naive way (either using "memcmp" or lexicographically on 2-byte integers) will sort things by codepoint, mostly, sort of, and otherwise by a weird order that falls out of details of the UTF-16 standard accidentally.[1] Perhaps you're using a legacy system that standardized on UTF-16 before the BMP ran out, and want to be compatible with its idiosyncratic sorting -- then converting things to UTF-16 before comparing makes sense. But that's not really appropriate to make as a general recommendation... better to convert UTF-16 to UTF-8, if you want to be entirely correct :-). [1] see e.g. http://icu.sourceforge.net/docs/papers/utf16_code_point_order.html -- Nathaniel -- Details are all that matters; God dwells there, and you never get to see Him if you don't struggle to get them right. -- Stephen Jay Gould
Re: [sqlite] UNICODE Support
On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote: > On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > > > But, since you brought it up - I have no expectations of SQLite > > integrating a full Unicode locale library, however it would be a great > > improvement if it would respect the current locale and use wcs* > > functions when available, or at least order by standard Unicode order > > instead of completely mangling things on UTF-8 codes. > What do you mean by "standard Unicode order" in this context? Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely correct) while sorting, to at least make them follow the same pattern. Ah, so Unicode codepoint order. Unfortunately this isn't accurate: UTF-8 and UTF-32/UCS-4 are both naturally in codepoint order (UTF-8 because of the MSB-first style format), but UTF-16 isn't due to the way surrogate pairs are constructed. UTF-16 is actually the oddball here :P
Re: [sqlite] UNICODE Support
On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote: On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > But, since you brought it up - I have no expectations of SQLite > integrating a full Unicode locale library, however it would be a great > improvement if it would respect the current locale and use wcs* > functions when available, or at least order by standard Unicode order > instead of completely mangling things on UTF-8 codes. What do you mean by "standard Unicode order" in this context? Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely correct) while sorting, to at least make them follow the same pattern. -- Cory Nelson http://www.int64.org
Re: [sqlite] UNICODE Support
On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: But, since you brought it up - I have no expectations of SQLite integrating a full Unicode locale library, however it would be a great improvement if it would respect the current locale and use wcs* functions when available, or at least order by standard Unicode order instead of completely mangling things on UTF-8 codes. What do you mean by "standard Unicode order" in this context?
Re: [sqlite] UNICODE Support
On 8/5/06, Cory Nelson <[EMAIL PROTECTED]> wrote: On 8/4/06, Nuno Lucas <[EMAIL PROTECTED]> wrote: > On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > > IE, using memcmp() to compare strings. I've been bitten by this > > before, with SQLite producing unexpected results when using UTF-8. > > Using UTF-16 has worked more reliably in my experience. > > SQLite only knows how to sort ASCII, so memcmp does that right (being > it UTF-8 or UTF-16). > > If you think about it, the only way sorting will work 100% is by > having some form of localization (because for each language different > sorting rules apply, _even_ for words composed only of ASCII > characters). > > Adding localization to SQLite is out of the question (it would > probably need a library as big as SQLite itself), so it's up to the > user to define it's own localization funtions and integrate them with > sqlite (there are all the necessary hooks ready for that). I was not talking about sorting in my post - I've had simple = index comparisons fail in UTF-8. You should have reported it. If it's true, it's a bug that needs to be corrected. But again I would say I never found a bug like that in sqlite. But, since you brought it up - I have no expectations of SQLite integrating a full Unicode locale library, however it would be a great improvement if it would respect the current locale and use wcs* functions when available, or at least order by standard Unicode order instead of completely mangling things on UTF-8 codes. For it to respect the current locale then the database would be invalid after moving/using it in another locale (the affected indexes would need to be rebuilt). Using the COLATE thing (which I never used exactly because of the problem above) you can define your own sort function that does what you want. On the second point, you may be right and can be considered a bug. A sorted table should have exactly the same order either if the database is using UTF-8 or UTF-16 internally (even if it doesn't follow the UNICODE order). At least it seems consistency on a query result should be assured on this. Maybe others have another point of view... Regards, ~Nuno Lucas
Re: [sqlite] UNICODE Support
On 8/4/06, Nuno Lucas <[EMAIL PROTECTED]> wrote: On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: > IE, using memcmp() to compare strings. I've been bitten by this > before, with SQLite producing unexpected results when using UTF-8. > Using UTF-16 has worked more reliably in my experience. SQLite only knows how to sort ASCII, so memcmp does that right (being it UTF-8 or UTF-16). If you think about it, the only way sorting will work 100% is by having some form of localization (because for each language different sorting rules apply, _even_ for words composed only of ASCII characters). Adding localization to SQLite is out of the question (it would probably need a library as big as SQLite itself), so it's up to the user to define it's own localization funtions and integrate them with sqlite (there are all the necessary hooks ready for that). I was not talking about sorting in my post - I've had simple = index comparisons fail in UTF-8. But, since you brought it up - I have no expectations of SQLite integrating a full Unicode locale library, however it would be a great improvement if it would respect the current locale and use wcs* functions when available, or at least order by standard Unicode order instead of completely mangling things on UTF-8 codes. Regards, ~Nuno Lucas -- Cory Nelson http://www.int64.org
Re: [sqlite] UNICODE Support
On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote: IE, using memcmp() to compare strings. I've been bitten by this before, with SQLite producing unexpected results when using UTF-8. Using UTF-16 has worked more reliably in my experience. SQLite only knows how to sort ASCII, so memcmp does that right (being it UTF-8 or UTF-16). If you think about it, the only way sorting will work 100% is by having some form of localization (because for each language different sorting rules apply, _even_ for words composed only of ASCII characters). Adding localization to SQLite is out of the question (it would probably need a library as big as SQLite itself), so it's up to the user to define it's own localization funtions and integrate them with sqlite (there are all the necessary hooks ready for that). Regards, ~Nuno Lucas
Re: [sqlite] UNICODE Support
On 8/4/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: "Cory Nelson" <[EMAIL PROTECTED]> wrote: > On 8/3/06, RohitPatel <[EMAIL PROTECTED]> wrote: > > I recommend using utf-16 in the database - sqlite doesn't fully > support utf-8, and some things may give unexpected results if you use > it. > Oh really? What exactly is missing from SQLite's UTF-8 support? Correct me if I'm wrong but from what I understand SQLite supports storing and converting between UTF-8 and UTF-16, but that is where the support stops. It is wrong (in my opinion) to claim UTF-8 support, at least without a clear upfront warning, when that's all it offers. IE, using memcmp() to compare strings. I've been bitten by this before, with SQLite producing unexpected results when using UTF-8. Using UTF-16 has worked more reliably in my experience. -- D. Richard Hipp <[EMAIL PROTECTED]> -- Cory Nelson http://www.int64.org
Re: [sqlite] UNICODE Support
"Cory Nelson" <[EMAIL PROTECTED]> wrote: > On 8/3/06, RohitPatel <[EMAIL PROTECTED]> wrote: > > I recommend using utf-16 in the database - sqlite doesn't fully > support utf-8, and some things may give unexpected results if you use > it. > Oh really? What exactly is missing from SQLite's UTF-8 support? -- D. Richard Hipp <[EMAIL PROTECTED]>
RE: [sqlite] UNICODE Support
You can convert your text using A2W() and W2A() functions (or others) before passing it to SQLite and after retrieving it back from SQLite. That's what we do (it's a Japanese application). Dennis -Original Message- From: Ajay [mailto:[EMAIL PROTECTED] Sent: Thursday, June 09, 2005 12:12 AM To: sqlite-users@sqlite.org Subject: RE: [sqlite] UNICODE Support But what about the SQLite Function's parameters whose data type is LPSTR ? Let me know the details to support wide char ? Regards, Ajay Sonawane -Original Message- From: Martin Engelschalk [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 08, 2005 6:48 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] UNICODE Support Hi, See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding' /Martin Ajay schrieb: >Hello there, > >Does SQLite support UNICODE? Can I store some Arabic or Chinese text in >database? > >If it does not support UNICODE, Is there any workaround for that? > > > >Regards, > >Ajay Sonawane > > > > > >
RE: [sqlite] UNICODE Support
But what about the SQLite Function's parameters whose data type is LPSTR ? Let me know the details to support wide char ? Regards, Ajay Sonawane -Original Message- From: Martin Engelschalk [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 08, 2005 6:48 PM To: sqlite-users@sqlite.org Subject: Re: [sqlite] UNICODE Support Hi, See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding' /Martin Ajay schrieb: >Hello there, > >Does SQLite support UNICODE? Can I store some Arabic or Chinese text in >database? > >If it does not support UNICODE, Is there any workaround for that? > > > >Regards, > >Ajay Sonawane > > > > > >
Re: [sqlite] UNICODE Support
Hi, See http://www.sqlite.org/pragma.html, search for 'PRAGMA encoding' /Martin Ajay schrieb: Hello there, Does SQLite support UNICODE? Can I store some Arabic or Chinese text in database? If it does not support UNICODE, Is there any workaround for that? Regards, Ajay Sonawane