Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution
I posted about this subject earlier too (in OpenSolaris we added Unicode tolower() and toupper() functions to SQLite 2.x). But the fact that SQLite 3.x already supports ICU disuaded me from pursuing the matter further. OpenSolaris has a light-weight Unicode API (licensed under the CDDL), much like the one you wrote, capable of doing normalization and case conversions, as well as case- and even normalization-insensitive string comparison. It hadn't occurred to me that ICU might be so large as to make use of alternative libraries interesting. When we port the relevant app in OpenSolaris to use SQLite 3.x I'll look again at contributing patches that add support for the OpenSolaris Unicode APIs. I hadn't considered something like an "unaccented collation" -- it sounds tricky. Which modifiers should be dropped, which should be kept? That can depend on what language the text is written in, and how much lossiness or what false positive rates you're willing to accept. I recommend that you try normalization-insensitive collation before resorting to an "unaccented collation." Nico -- - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution
On Jan 4, 2008 8:43 AM, Nicolas Williams <[EMAIL PROTECTED]> wrote: > It hadn't occurred to me that ICU might be so large as to make use of > alternative libraries interesting. When we port the relevant app in > OpenSolaris to use SQLite 3.x I'll look again at contributing patches > that add support for the OpenSolaris Unicode APIs. ICU doesn't need to be all that large. By default, it builds everything, and encodes the data it needs into an object file, so it is pretty large. But you can cut that down to just the items you actually need. If all you need is case-folding, then the footprint is going to be a lot smaller. Of course, if you're using the ICU libraries provided with your system, you'll be getting it essentially for free. -scott - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution
Nice! I can imagine that this could be also used e.g. in FTS3, particularly unaccent() function could make searching for international users better. Thans for sharing your code, Jiri - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution
On Jan 3, 2008 4:10 AM, ioannis <[EMAIL PROTECTED]> wrote: > Dear all SQLite3 users, > > Recently i have been working on a dictionary style project that had to > work with UNICODE non-latin1 strings, i did try the ICU project but i > wasn't satisfied with the extra baggage that came with it. > I would like to recommend the following possible solution to the long > standing UNICODE issue, that was built in as an ICU alternative > (excluding collation's), and could be easily be included in the SQLite > core as default behavior. > > http://ioannis.mpsounds.net/blog/?dl=sqlite3_unicode.c > > The above file contains mapping tables for lower(), upper(), title(), > fold()* characters based on UNICODE mapping tables as described > currently by the UNICODE standard v5.1.0 beta, that are used by > functions to transform characters to their respective folding cases. > (These tables were built by a modified version of Loic Dachary builder > in order to included required case transformations) > * UNICODE uses case folding mapping tables to implement non-case > sensitive comparison sequences (eg LIKE). > > The above file utilizes the existing ICU infrastructure built in > SQLite in order to activate the extra functionality, to automatically > : > - override the LIKE operation, to support full UNICODE non-case > sensitive comparison > - override upper(), lower(), to support case transformation of UNICODE > characters based on UNICODE mapping tables as described currently by > the UNICODE standard v5.1.0 beta > - provide title() and fold() functions, also based on UNICODE mapping > tables as described currently by the UNICODE standard v5.1.0 beta > - provide unaccent() function, (based on the unac library designed for > linux by Loic Dachary) to decompose UNICODE characters to there > unaccented equivalents in order to perform simpler queries and return > wider range of results. (eg. ά -> α, æ -> ae in the latter example the > string will automatically grow by 1 character point) > > In comparison to ICU no collation sequences have been implemented yet. > The above functionalities have been designed to be included/excluded > independently according to specific needs in order to minimize the > size of the library. > The total overhead over the SQLite library size with all functionality > enabled is approximately 70~80KB. > > The above file has not been thoroughly tested, but i consider the > implementation to stable. > You can leave comments, bug reports, suggestions on this board or at > http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-support > (PS. I am not an SQLite expert, but i had to improvise on some extent > on this matter.) > > Thank you very much. > I guess I'm confused at what the purpose of this is. I'm far from a Unicode expert but my understand thus far is that there is no One solution. Locales are there for a reason - different places can use different sort orders and case conversions. Your blog makes using locales seem as a detriment, but I'm not sure how you can get around it. -- Cory Nelson http://www.int64.org
[sqlite] Possible UNICODE LIKE, upper(), lower() function solution
Dear all SQLite3 users, Recently i have been working on a dictionary style project that had to work with UNICODE non-latin1 strings, i did try the ICU project but i wasn't satisfied with the extra baggage that came with it. I would like to recommend the following possible solution to the long standing UNICODE issue, that was built in as an ICU alternative (excluding collation's), and could be easily be included in the SQLite core as default behavior. http://ioannis.mpsounds.net/blog/?dl=sqlite3_unicode.c The above file contains mapping tables for lower(), upper(), title(), fold()* characters based on UNICODE mapping tables as described currently by the UNICODE standard v5.1.0 beta, that are used by functions to transform characters to their respective folding cases. (These tables were built by a modified version of Loic Dachary builder in order to included required case transformations) * UNICODE uses case folding mapping tables to implement non-case sensitive comparison sequences (eg LIKE). The above file utilizes the existing ICU infrastructure built in SQLite in order to activate the extra functionality, to automatically : - override the LIKE operation, to support full UNICODE non-case sensitive comparison - override upper(), lower(), to support case transformation of UNICODE characters based on UNICODE mapping tables as described currently by the UNICODE standard v5.1.0 beta - provide title() and fold() functions, also based on UNICODE mapping tables as described currently by the UNICODE standard v5.1.0 beta - provide unaccent() function, (based on the unac library designed for linux by Loic Dachary) to decompose UNICODE characters to there unaccented equivalents in order to perform simpler queries and return wider range of results. (eg. ά -> α, æ -> ae in the latter example the string will automatically grow by 1 character point) In comparison to ICU no collation sequences have been implemented yet. The above functionalities have been designed to be included/excluded independently according to specific needs in order to minimize the size of the library. The total overhead over the SQLite library size with all functionality enabled is approximately 70~80KB. The above file has not been thoroughly tested, but i consider the implementation to stable. You can leave comments, bug reports, suggestions on this board or at http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-support (PS. I am not an SQLite expert, but i had to improvise on some extent on this matter.) Thank you very much.