Re: [sqlite] Unicode collation
On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote: It seems more natural to use the standard C locale names to me (the usual "pt_PT" and "pt_BR" for Portuguese/Portugal culture and Portuguese/Brazil culture) , but I'm open to suggestions when that problem arise, and I'm sure there are already standards we can follow in relation to that. RFCs 4646 and 4647 cover this convention. They get rather verbose about it, but it boils down to using ISO-defined language and country codes in a particular format. Most internet protocols (http etc) use the same convention, so we'd be in good company. - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/28/07, Trevor Talbot <[EMAIL PROTECTED]> wrote: On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote: > One thing I noticed is that "collations" != "case change". This will > not make it possible to use UPPER/LOWER with the same data on the > table, but maybe we can work on something in that respect, also. If the intent is case-insensitive comparisons a la LIKE, note that "upper" and "lower" operations alone won't suffice, because in some scripts a single character maps to multiple characters in one direction. Unicode defines a "case folding" operation for such uses, which does full expansion no matter which direction it is (lower->upper or upper->lower), allowing the result to be handled with a simple memcmp()-style binary comparison. Right, an example is the german "straße" -> "STRASSE" (the 'ß' - beta - character being replaced by "SS" in uppercase, for those who can't see the mail properly). It should certainly be possible to cover those operations too, it's just as nuanced as the collation work :) Yes. I'll try to think on this too, but not the big priority. Best regards, ~Nuno Lucas - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/28/07, Jiri Hajek <[EMAIL PROTECTED]> wrote: > My idea is to implement the UCA collation in SQLite (with the usual > OMIT_* #ifdef's), using the DUCET table as base, and if people need > the tailoring part for localized sorting, have it be optional by > having a "sqlite_collation_data" table with the needed locale data > included on the database. That would certainly be great if this is implemented. Note, however, that it doesn't fully solve the issues described in this thread - i.e. if you create a DB by some SQLite version and then use it by a newer version, where some elements were added/modified in DUCET, indexes of these DBs wouldn't be compatible. It can be resolved in several ways, e.g. as suggested to have all DUCET data stored in a special table in SQLite database. It's just a matter of choosing a well-balanced solution... There are 2 problems: UCA changes and DUCET (and/or other locale data) changes. DUCET and locale data is in the database table, so it can only change by user intervention, meaning it's his fault if done without rebuilding the affected table index(es). UCA changes are more problematic, but they are less frequent (it seems there was a minor change between Unicode 4.0 and 4.1, though). I don't see any good solution for this other than having an extra field on the database file (or in the collation data tables) with the UCA version and advise the user when using an UCA version different from the one it was created (or last used, as by default no collation data is needed). It's very probable that not many users are actually bothered by this (the algorithm is probably not changing much over time, and probably not in incompatible ways for most locales). Maybe we can just make sure "PRAGMA consistency_check" notices if there is an inconsistency and that VACUUM will fix this. Btw, even if this is implemented, there is still a need for a standardization such new collation names. E.g. that new language neutral collation could be called Unicode or DUCET? And how about language specific collations? After some thoughts, I'd suggest something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e. some characters are properly ignored, given for example by an ordering of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode strings - i.e. special characters aren't ignored, so that above words would be ordered as 'co-op', 'con', 'coop'). I don't find this particularly important, because the collation name has to be on the tables, so it can be called "DEFAULT" and have only data for the "fr_FR" locale on the "default" tables (including the DUCET base embedded) on embedded devices. I'm thinking there will be a "reference" database with all locale data, and it's up to users to use it "as is" or build their own (maybe just rename locales). It seems more natural to use the standard C locale names to me (the usual "pt_PT" and "pt_BR" for Portuguese/Portugal culture and Portuguese/Brazil culture) , but I'm open to suggestions when that problem arise, and I'm sure there are already standards we can follow in relation to that. Well, I will probably only have time to actually put words into code next weekend, so I will say something when I have source code to show. Best regards, ~Nuno Lucas Jiri - To unsubscribe, send email to [EMAIL PROTECTED] - - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/28/07, Jiri Hajek <[EMAIL PROTECTED]> wrote: Btw, even if this is implemented, there is still a need for a standardization such new collation names. E.g. that new language neutral collation could be called Unicode or DUCET? And how about language specific collations? After some thoughts, I'd suggest something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e. some characters are properly ignored, given for example by an ordering of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode strings - i.e. special characters aren't ignored, so that above words would be ordered as 'co-op', 'con', 'coop'). At least part of the collation name would come from the special database table, so it may be useful to have some general prefix that indicates sqlite is using the internal algorithm + table style mechanism for it. E.g. "IUC:" for "Included Unicode Collation" or something. That's probably not a valid identifier currently, but you get the idea... - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote: One thing I noticed is that "collations" != "case change". This will not make it possible to use UPPER/LOWER with the same data on the table, but maybe we can work on something in that respect, also. If the intent is case-insensitive comparisons a la LIKE, note that "upper" and "lower" operations alone won't suffice, because in some scripts a single character maps to multiple characters in one direction. Unicode defines a "case folding" operation for such uses, which does full expansion no matter which direction it is (lower->upper or upper->lower), allowing the result to be handled with a simple memcmp()-style binary comparison. It should certainly be possible to cover those operations too, it's just as nuanced as the collation work :) - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
My idea is to implement the UCA collation in SQLite (with the usual OMIT_* #ifdef's), using the DUCET table as base, and if people need the tailoring part for localized sorting, have it be optional by having a "sqlite_collation_data" table with the needed locale data included on the database. That would certainly be great if this is implemented. Note, however, that it doesn't fully solve the issues described in this thread - i.e. if you create a DB by some SQLite version and then use it by a newer version, where some elements were added/modified in DUCET, indexes of these DBs wouldn't be compatible. It can be resolved in several ways, e.g. as suggested to have all DUCET data stored in a special table in SQLite database. It's just a matter of choosing a well-balanced solution... Btw, even if this is implemented, there is still a need for a standardization such new collation names. E.g. that new language neutral collation could be called Unicode or DUCET? And how about language specific collations? After some thoughts, I'd suggest something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e. some characters are properly ignored, given for example by an ordering of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode strings - i.e. special characters aren't ignored, so that above words would be ordered as 'co-op', 'con', 'coop'). Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote: > Trevor proposed method doesn't need that, because the collation data > goes with the database (which is the big advantage to me). And if you > want to use another collation, you only need to import the collation > data needed for that locale. Well, Trevor's proposal sounds good indeed, but realistically, I don't see any big chances it to be implemented - collations aren't that simple and maintaining all the related issues without relying on some external code is probably too much to expect. Well, MySQL does it since v4.1, and from what I googled (I don't have the links on this PC), the collation data needed with the adjustments (tailoring) for the different locales [1] is not that much (and it doesn't change the algorithm code, only data tables are needed). From what I have seen, the Unicode Collation Algorithm (UCA) [2] is already a good base for general ordering, and it seems the European Ordering Rules (EOR) [3] is similar to UCA with the Default Unicode Collation Element Table - DUCET, which already is good enough for many people (including 90% of my needs). Yesterday it looked like unicode.org was down and I haven't had the chance to look at the UCA report (UTS #10) in depth, but I will try to come up with a proposal for sqlite. My idea is to implement the UCA collation in SQLite (with the usual OMIT_* #ifdef's), using the DUCET table as base, and if people need the tailoring part for localized sorting, have it be optional by having a "sqlite_collation_data" table with the needed locale data included on the database. One thing I noticed is that "collations" != "case change". This will not make it possible to use UPPER/LOWER with the same data on the table, but maybe we can work on something in that respect, also. Regards, ~Nuno Lucas [1] http://developer.mimer.com/collations/charts/index.tml [2] http://www.unicode.org/reports/tr10/ [3] http://en.wikipedia.org/wiki/European_Ordering_Rules Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
> Actually, reading one of the links you posted > (http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx - > everybody please read it before continuing in discussion), I got (what > I think is a great) idea: Let's include version information about > collation algorithm we are using. So, extending my previous > suggestion, the SQLite collation title would look like > 'en_AU_Win_2_1', where the individual parts mean: > 'en_AU' - obviously a sort order > 'Win' - that we are basing the ordering on default Windows methods > (otherwise could be e.g. 'ICU') > '2_1' - version string as retrieved from GetNLSVersion() function and > NLSVERSIONINFO struct (see the link above for details). I don't see how that is different than the previous difficulty of having incompatible databases between systems. The only new thing is we can check it, but the database will have inconsistent behaviour anyway. As I wrote, my main intention was to handle incompatibilities among applications reading a database on a given system - and this would handle the issue perfectly. > This way we are absolutely sure that we don't corrupt SQLite database. > While I'm working on a single system, the version string wouldn't > change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1' > collation on a system NLSVERSIONINFO returns version 2.3, I know I > have to reindex, change collation title to 'en_AU_Win_2_3' and then > can safely use the database. For small databases, the task of re-indexing is not big, but you are forgetting you can have a database in a shared network folder, used by PC's in different parts of the world and even different OSs (with samba/cifs). That's why I like Trevor's idea so much. You are right in this, my proposal handles some problems, but can't handle this situation. That said, it at least allows sharing databases under compatible platforms - which is still an improvement over the current situation where every application has its own collation and there's absolutely no compatibility. Trevor proposed method doesn't need that, because the collation data goes with the database (which is the big advantage to me). And if you want to use another collation, you only need to import the collation data needed for that locale. Well, Trevor's proposal sounds good indeed, but realistically, I don't see any big chances it to be implemented - collations aren't that simple and maintaining all the related issues without relying on some external code is probably too much to expect. Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote: Actually, reading one of the links you posted (http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx - everybody please read it before continuing in discussion), I got (what I think is a great) idea: Let's include version information about collation algorithm we are using. So, extending my previous suggestion, the SQLite collation title would look like 'en_AU_Win_2_1', where the individual parts mean: 'en_AU' - obviously a sort order 'Win' - that we are basing the ordering on default Windows methods (otherwise could be e.g. 'ICU') '2_1' - version string as retrieved from GetNLSVersion() function and NLSVERSIONINFO struct (see the link above for details). I don't see how that is different than the previous difficulty of having incompatible databases between systems. The only new thing is we can check it, but the database will have inconsistent behaviour anyway. This way we are absolutely sure that we don't corrupt SQLite database. While I'm working on a single system, the version string wouldn't change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1' collation on a system NLSVERSIONINFO returns version 2.3, I know I have to reindex, change collation title to 'en_AU_Win_2_3' and then can safely use the database. Trevor proposed method doesn't need that, because the collation data goes with the database (which is the big advantage to me). And if you want to use another collation, you only need to import the collation data needed for that locale. It seems to me to be a really cross-platform solution and what's best - without any coding on SQLite side necessary (even though possible to handle some parts of this proposal internally). On the articles you have pointed (but I don't have a link right now) also states that SQL Server and the Jet engine don't use the system collation functions, but instead have their own integrated collation system (based on the Windows one, off course) exactly for the same reason. For small databases, the task of re-indexing is not big, but you are forgetting you can have a database in a shared network folder, used by PC's in different parts of the world and even different OSs (with samba/cifs). That's why I like Trevor's idea so much. Best regards, ~Nuno Lucas Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
After thinking a bit, it occurs to me that there's a compromise for the Unicode case that might be workable. The algorithm for collation is pretty stable, it's just the locale data that's the problem. If SQLite understands the algorithm, then locale data can go into special tables in the database itself. Applications manipulating the database schema would need to have the relevant collation data on hand to fill in the database, but other apps concerned with only the data could operate without any special knowledge. This approach keeps the database internally self-consistent while avoiding platform and versioning issues. Something like this could probably work, but it still seems to me that we are assigning SQLite tasks that OS or some other libraries should handle. Actually, reading one of the links you posted (http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx - everybody please read it before continuing in discussion), I got (what I think is a great) idea: Let's include version information about collation algorithm we are using. So, extending my previous suggestion, the SQLite collation title would look like 'en_AU_Win_2_1', where the individual parts mean: 'en_AU' - obviously a sort order 'Win' - that we are basing the ordering on default Windows methods (otherwise could be e.g. 'ICU') '2_1' - version string as retrieved from GetNLSVersion() function and NLSVERSIONINFO struct (see the link above for details). This way we are absolutely sure that we don't corrupt SQLite database. While I'm working on a single system, the version string wouldn't change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1' collation on a system NLSVERSIONINFO returns version 2.3, I know I have to reindex, change collation title to 'en_AU_Win_2_3' and then can safely use the database. It seems to me to be a really cross-platform solution and what's best - without any coding on SQLite side necessary (even though possible to handle some parts of this proposal internally). Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/27/07, Trevor Talbot <[EMAIL PROTECTED]> wrote: On 6/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > So the choices seem to be: > > > > > > (1) Databases that corrupt if you move across platforms. > > > (2) A 10MB database engine > > > (3) Leave things as they are > OK. Here is a crazy idea for consideration: > > You know that you can create a custom collating function as a > DLL or shared library and load it at runtime, right? This has > been a capability since version 3.3.7. Suppose we define a > special table in the database file that is designed to hold > DLLs and/or shared libraries. After thinking a bit, it occurs to me that there's a compromise for the Unicode case that might be workable. The algorithm for collation is pretty stable, it's just the locale data that's the problem. If SQLite understands the algorithm, then locale data can go into special tables in the database itself. Applications manipulating the database schema would need to have the relevant collation data on hand to fill in the database, but other apps concerned with only the data could operate without any special knowledge. This approach keeps the database internally self-consistent while avoiding platform and versioning issues. It's a thought. I love your idea and seems the most promising by far. If we could have it in a way such that people who only use a specific set of locales (like applications that are only internationalized for the german, spanish and italian market, for example) could only have the locale data they need (like ignore hebrew or chinese), I believe that would be the best of all worlds. Maybe we could have a database (the one with the 10MB in size) in the contrib area with all the locale data and people would then import the locales they need into their own databases on creation. Indexes would continue to be coherent across platforms and different locales (with the proper care). This is good even for embedded devices, as they are usually deployed on a single locale, and can have only the locale data they need. Best regards, ~Nuno Lucas - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > So the choices seem to be: > > > > (1) Databases that corrupt if you move across platforms. > > (2) A 10MB database engine > > (3) Leave things as they are OK. Here is a crazy idea for consideration: You know that you can create a custom collating function as a DLL or shared library and load it at runtime, right? This has been a capability since version 3.3.7. Suppose we define a special table in the database file that is designed to hold DLLs and/or shared libraries. After thinking a bit, it occurs to me that there's a compromise for the Unicode case that might be workable. The algorithm for collation is pretty stable, it's just the locale data that's the problem. If SQLite understands the algorithm, then locale data can go into special tables in the database itself. Applications manipulating the database schema would need to have the relevant collation data on hand to fill in the database, but other apps concerned with only the data could operate without any special knowledge. This approach keeps the database internally self-consistent while avoiding platform and versioning issues. It's a thought. On 6/27/07, Joe Wilson <[EMAIL PROTECTED]> wrote: The solution is obvious: http://en.wikipedia.org/wiki/Esperanto :D - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote: I tried to think about this in as many details as possible and I'd say that the original problem that I'd like to solve here (i.e. collation of Unicode character sets) doesn't require any special complex handling in SQLite. We really should just define how will individual collations be named, so that Unicode SQLite databases become portable. The only argument against this was that individual implementations of Unicode standard (i.e. mainly internal Windows methods or ICU library) could differ. However, is it really a problem? I'd say that it isn't. Unicode specifies exactly how characters should be sorted and so _if_ there really are any differences between Windows and ICU implementations, they should be considered as a bug and be solved there. I.e. it isn't anything SQLite should be responsible for. No, it doesn't. See _The Unicode Standard 5.0_ chapter 5 section 16 (http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf). Unicode Technical Standard #10 (http://www.unicode.org/reports/tr10/) specifies how collections work, and provides a default table, but note carefully that the table is versioned and contains incompatible changes between versions -- see the Modifications section at the end. It does not deal with locale-specific collation at all. I'll also refer you to the blog of Michael Kaplan, someone involved in both the Unicode Consortium and the NLS group at Microsoft: http://blogs.msdn.com/michkap/search.aspx?q=collation On platform compatibility: http://blogs.msdn.com/michkap/archive/2004/11/28/271121.aspx http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx What Microsoft does with their own SQL Server: http://blogs.msdn.com/michkap/archive/2005/03/11/394359.aspx The same issues apply to anything you try to do with SQLite. Unicode is nowhere near as simple as people would like it to be, which is why everyone picks a specific platform with a known state and sticks to it like glue. There be dragons here. - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
A moments reflection convinces me that this is not a good idea as stated. But perhaps it can be the seed for a idea that will actually work. Is there some way of adding customizations to the database file itself, or perhaps to a separate file in a standard place the SQLite always knows to look, so that custom enhancements and extensions to SQLite can be accomodated in standard tools? Worth thinking about, perhaps I tried to think about this in as many details as possible and I'd say that the original problem that I'd like to solve here (i.e. collation of Unicode character sets) doesn't require any special complex handling in SQLite. We really should just define how will individual collations be named, so that Unicode SQLite databases become portable. The only argument against this was that individual implementations of Unicode standard (i.e. mainly internal Windows methods or ICU library) could differ. However, is it really a problem? I'd say that it isn't. Unicode specifies exactly how characters should be sorted and so _if_ there really are any differences between Windows and ICU implementations, they should be considered as a bug and be solved there. I.e. it isn't anything SQLite should be responsible for. Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
--- [EMAIL PROTECTED] wrote: > OK. Here is a crazy idea for consideration: > > You know that you can create a custom collating function as a > DLL or shared library and load it at runtime, right? This has > been a capability since version 3.3.7. Suppose we define a > special table in the database file that is designed to hold > DLLs and/or shared libraries. Perhaps something like this: > > CREATE TABLE sqlite_shared_libs( >architecture TEXT, >library BLOB > ); > > You can put DLLs or shared libraries for any custom collating > functions (also any custom SQL functions or virtual table > implementations) into this table and SQLite will load them > automatically which is starts up. This allows you to create > any non-standard database extensions you want and make them > completely portable to any database editor or other tool that > wants to access the database. Christian Werner's SQLiteODBC driver does something similar to your proposal except it compiles C function extensions on the fly using TCC: http://www.mail-archive.com/sqlite-users@sqlite.org/msg21945.html http://sqlite.org/contrib/download/sqlite%2Btcc.tar.gz?get=13 The idea could be generalized to run any arbitrary binary and/or script to make a shared library. See also the sqlite_extensions proposal at the bottom of this page: http://www.sqlite.org/cvstrac/wiki?p=LoadableExtensions It takes into account extension library load/unload order as well as the name the names of the function entry points. Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ - To unsubscribe, send email to [EMAIL PROTECTED] -
RE: [sqlite] Unicode collation
> -Original Message- > From: Jiri Hajek [mailto:[EMAIL PROTECTED] > Sent: Monday, June 25, 2007 8:14 AM > To: sqlite-users@sqlite.org > Subject: Re: [sqlite] Unicode collation > > > So the choices seem to be: > > > > (1) Databases that corrupt if you move across platforms. > > (2) A 10MB database engine > > (3) Leave things as they are > > > > I prefer to go with (3) > > Yes, based on the facts I agree that (3) is the best way to go - a > developer can decide whether to use Windows methods only, or be > cross-platform and use ICU or anything else. > > However, it isn't what I mean. I suggest to publish some kind of a > standard that database developers would use. I.e. SQLite web would > define, that if you want to make a Unicode database that can be > openned by other applications, you should define collation e.g. as > 'en_AU' for Australian English. This way, if I want let users of my > application to open DB in e.g. in some database editor, I can use this > standard and users will be able to open it in many applications > following the standard. > > So, what I propose aren't actually any changes in SQLite, but rather > publishing something that would guide all SQLite developers. I thought that was what was being responded to; publishing such a recommendation leads to problem (1), because the collation rules will not match in different environments. -- James - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
"Jiri Hajek" <[EMAIL PROTECTED]> wrote: > > So the choices seem to be: > > > > (1) Databases that corrupt if you move across platforms. > > (2) A 10MB database engine > > (3) Leave things as they are > > > > I prefer to go with (3) > > Yes, based on the facts I agree that (3) is the best way to go - a > developer can decide whether to use Windows methods only, or be > cross-platform and use ICU or anything else. > > However, it isn't what I mean. I suggest to publish some kind of a > standard that database developers would use. I.e. SQLite web would > define, that if you want to make a Unicode database that can be > openned by other applications, you should define collation e.g. as > 'en_AU' for Australian English. This way, if I want let users of my > application to open DB in e.g. in some database editor, I can use this > standard and users will be able to open it in many applications > following the standard. > > So, what I propose aren't actually any changes in SQLite, but rather > publishing something that would guide all SQLite developers. > OK. Here is a crazy idea for consideration: You know that you can create a custom collating function as a DLL or shared library and load it at runtime, right? This has been a capability since version 3.3.7. Suppose we define a special table in the database file that is designed to hold DLLs and/or shared libraries. Perhaps something like this: CREATE TABLE sqlite_shared_libs( architecture TEXT, library BLOB ); You can put DLLs or shared libraries for any custom collating functions (also any custom SQL functions or virtual table implementations) into this table and SQLite will load them automatically which is starts up. This allows you to create any non-standard database extensions you want and make them completely portable to any database editor or other tool that wants to access the database. A moments reflection convinces me that this is not a good idea as stated. But perhaps it can be the seed for a idea that will actually work. Is there some way of adding customizations to the database file itself, or perhaps to a separate file in a standard place the SQLite always knows to look, so that custom enhancements and extensions to SQLite can be accomodated in standard tools? Worth thinking about, perhaps -- D. Richard Hipp <[EMAIL PROTECTED]> - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
So the choices seem to be: (1) Databases that corrupt if you move across platforms. (2) A 10MB database engine (3) Leave things as they are I prefer to go with (3) Yes, based on the facts I agree that (3) is the best way to go - a developer can decide whether to use Windows methods only, or be cross-platform and use ICU or anything else. However, it isn't what I mean. I suggest to publish some kind of a standard that database developers would use. I.e. SQLite web would define, that if you want to make a Unicode database that can be openned by other applications, you should define collation e.g. as 'en_AU' for Australian English. This way, if I want let users of my application to open DB in e.g. in some database editor, I can use this standard and users will be able to open it in many applications following the standard. So, what I propose aren't actually any changes in SQLite, but rather publishing something that would guide all SQLite developers. Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
"Jiri Hajek" <[EMAIL PROTECTED]> wrote: > Ok, drh, could you share your thoughts about this. Could something > like this, i.e. some kind of recommendation be created, so that all > applications are consistent in Unicode support? > Dan's observation is correct. It seems to me that the collating sequences generated by version editions of Windows are unlikely to be identical. Almost certainly the collating sequence for Windows will be different from ICU. So if we were to make any of the collating sequences standard, then suddenly SQLite databases are no longer cross-platform. We could build in our own collating functions. But the collation tables are huge. The 250KB SQLite library would mushroom into the 10MB range. So the choices seem to be: (1) Databases that corrupt if you move across platforms. (2) A 10MB database engine (3) Leave things as they are I prefer to go with (3) -- D. Richard Hipp <[EMAIL PROTECTED]> - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
Ok, drh, could you share your thoughts about this. Could something like this, i.e. some kind of recommendation be created, so that all applications are consistent in Unicode support? Thanks, Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On Sat, 2007-06-23 at 10:56 +0200, Jiri Hajek wrote: > > The reason is as you've surmised. Not all systems have full unicode > > support (I'm not sure, but if I had to guess, I would say very few > > systems do). Including an implementation with SQLite would bloat > > the library to at least several times it's current size. > > I know really well only Windows where it's only about using > CompareString() function, i.e. almost no code in SQLite. > > > Another reason is that it's a pretty complex topic. Supporting > > most European languages would probably be possible without too > > much trouble, but once you get into Asian and Middle-eastern > > languages I think it's much harder. > > As far as I know, there wouldn't be any complexity on SQLite's side - > it's only about calling proper methods (be it CompareString() on > Windows or ICU methods elsewhere), i.e. again pretty much no code > needed in SQLite. > > > There is an extension packaged with SQLite sources that uses the > > ICU library to provide locale dependent collation sequences and > > case folding. See here for details: > > ICU is nice, but pretty large. Since I develop for Windows, I'd rather > not distribute it with my application considering that this is alredy > provided in Windows in reasonable quality. > > Anyway, I guess that the question isn't mainly about how to implement > this in SQLite, but about the problem that SQLite doesn't define any > standard how to handle Unicode. Currently, any application that needs > to work with Unicode data has to define its own collation and name it > 'tr_TR', 'turkish', 'MyTurkish', or any other way. The result is a big > mess and no chance of opening SQLite database in other application > than it was designed for. > > So, why don't we (or you - SQLite developers) define how to name > collations (e.g. that 'tr_TR', 'en_AU', ... standard?) and then every > database complying this would be perfectly portable. That's not too bad an idea. One thing to watch out for is that if the definition of a collation sequence that you have used to create an index varies even slightly from machine to machine (say from win98 to windows vista), you are headed for database corruption. For that reason I'd be a bit reluctant to encourage people to use more than one implementation of a named collation sequence. Dan. - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
The reason is as you've surmised. Not all systems have full unicode support (I'm not sure, but if I had to guess, I would say very few systems do). Including an implementation with SQLite would bloat the library to at least several times it's current size. I know really well only Windows where it's only about using CompareString() function, i.e. almost no code in SQLite. Another reason is that it's a pretty complex topic. Supporting most European languages would probably be possible without too much trouble, but once you get into Asian and Middle-eastern languages I think it's much harder. As far as I know, there wouldn't be any complexity on SQLite's side - it's only about calling proper methods (be it CompareString() on Windows or ICU methods elsewhere), i.e. again pretty much no code needed in SQLite. There is an extension packaged with SQLite sources that uses the ICU library to provide locale dependent collation sequences and case folding. See here for details: ICU is nice, but pretty large. Since I develop for Windows, I'd rather not distribute it with my application considering that this is alredy provided in Windows in reasonable quality. Anyway, I guess that the question isn't mainly about how to implement this in SQLite, but about the problem that SQLite doesn't define any standard how to handle Unicode. Currently, any application that needs to work with Unicode data has to define its own collation and name it 'tr_TR', 'turkish', 'MyTurkish', or any other way. The result is a big mess and no chance of opening SQLite database in other application than it was designed for. So, why don't we (or you - SQLite developers) define how to name collations (e.g. that 'tr_TR', 'en_AU', ... standard?) and then every database complying this would be perfectly portable. Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Unicode collation
On Fri, 2007-06-22 at 18:57 +0200, Jiri Hajek wrote: > Hello, > > I wonder whether there are any plans to include internally proper > Unicode comparisons? Don't get me wrong, I think that it's great that > SQLite supports custom collations, there's absolutely no problem to > handle it in internally for my database, but problem is that if I > define UNICODE collation, no other application knows about it and so > users can't open it in any SQLite DB editor. > > Nowadays applications without Unicode support slowly become rare, as I > see, I'm not the first one asking for this kind of support in SQLite. > Is there any technical reason why not to include UNICODE and e.g. > IUNICODE (for case-insensitive comparisons) collations in SQLite? Is > it because of some systems that don't have (full) Unicode support? In > such a case, I guess that it could be a compile-time option. The reason is as you've surmised. Not all systems have full unicode support (I'm not sure, but if I had to guess, I would say very few systems do). Including an implementation with SQLite would bloat the library to at least several times it's current size. Another reason is that it's a pretty complex topic. Supporting most European languages would probably be possible without too much trouble, but once you get into Asian and Middle-eastern languages I think it's much harder. There is an extension packaged with SQLite sources that uses the ICU library to provide locale dependent collation sequences and case folding. See here for details: http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt Dan. - To unsubscribe, send email to [EMAIL PROTECTED] -