Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution

2008-01-04 Thread Nicolas Williams
I posted about this subject earlier too (in OpenSolaris we added Unicode
tolower() and toupper() functions to SQLite 2.x).  But the fact that
SQLite 3.x already supports ICU disuaded me from pursuing the matter
further.  OpenSolaris has a light-weight Unicode API (licensed under the
CDDL), much like the one you wrote, capable of doing normalization and
case conversions, as well as case- and even normalization-insensitive
string comparison.

It hadn't occurred to me that ICU might be so large as to make use of
alternative libraries interesting.  When we port the relevant app in
OpenSolaris to use SQLite 3.x I'll look again at contributing patches
that add support for the OpenSolaris Unicode APIs.

I hadn't considered something like an "unaccented collation" -- it
sounds tricky.  Which modifiers should be dropped, which should be kept?
That can depend on what language the text is written in, and how much
lossiness or what false positive rates you're willing to accept.  I
recommend that you try normalization-insensitive collation before
resorting to an "unaccented collation."

Nico
-- 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution

2008-01-04 Thread Scott Hess
On Jan 4, 2008 8:43 AM, Nicolas Williams <[EMAIL PROTECTED]> wrote:
> It hadn't occurred to me that ICU might be so large as to make use of
> alternative libraries interesting.  When we port the relevant app in
> OpenSolaris to use SQLite 3.x I'll look again at contributing patches
> that add support for the OpenSolaris Unicode APIs.

ICU doesn't need to be all that large.  By default, it builds
everything, and encodes the data it needs into an object file, so it
is pretty large.  But you can cut that down to just the items you
actually need.  If all you need is case-folding, then the footprint is
going to be a lot smaller.

Of course, if you're using the ICU libraries provided with your
system, you'll be getting it essentially for free.

-scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution

2008-01-04 Thread Jiri Hajek
Nice! I can imagine that this could be also used e.g. in FTS3,
particularly unaccent() function could make searching for
international users better.

Thans for sharing your code,
Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Possible UNICODE LIKE, upper(), lower() function solution

2008-01-03 Thread Cory Nelson
On Jan 3, 2008 4:10 AM, ioannis <[EMAIL PROTECTED]> wrote:
> Dear all SQLite3 users,
>
> Recently i have been working on a dictionary style project that had to
> work with UNICODE non-latin1 strings, i did try the ICU project but i
> wasn't satisfied with the extra baggage that came with it.
> I would like to recommend the following possible solution to the long
> standing UNICODE issue, that was built in as an ICU alternative
> (excluding collation's), and could be easily be included in the SQLite
> core as default behavior.
>
> http://ioannis.mpsounds.net/blog/?dl=sqlite3_unicode.c
>
> The above file contains mapping tables for lower(), upper(), title(),
> fold()* characters based on UNICODE mapping tables as described
> currently by the UNICODE standard v5.1.0 beta, that are used by
> functions to transform characters to their respective folding cases.
> (These tables were built by a modified version of Loic Dachary builder
> in order to included required case transformations)
> * UNICODE uses case folding mapping tables to implement non-case
> sensitive comparison sequences (eg LIKE).
>
> The above file utilizes the existing ICU infrastructure built in
> SQLite in order to activate the extra functionality, to automatically
> :
> - override the LIKE operation, to support full UNICODE non-case
> sensitive comparison
> - override upper(), lower(), to support case transformation of UNICODE
> characters based on UNICODE mapping tables as described currently by
> the UNICODE standard v5.1.0 beta
> - provide title() and fold() functions, also based on UNICODE mapping
> tables as described currently by the UNICODE standard v5.1.0 beta
> - provide unaccent() function, (based on the unac library designed for
> linux by Loic Dachary) to decompose UNICODE characters to there
> unaccented equivalents in order to perform simpler queries and return
> wider range of results. (eg. ά -> α, æ -> ae in the latter example the
> string will automatically grow by 1 character point)
>
> In comparison to ICU no collation sequences have been implemented yet.
> The above functionalities have been designed to be included/excluded
> independently according to specific needs in order to minimize the
> size of the library.
> The total overhead over the SQLite library size with all functionality
> enabled is approximately 70~80KB.
>
> The above file has not been thoroughly tested, but i consider the
> implementation to stable.
> You can leave comments, bug reports, suggestions on this board or at
> http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-support
> (PS. I am not an SQLite expert, but i had to improvise on some extent
> on this matter.)
>
> Thank you very much.
>

I guess I'm confused at what the purpose of this is.  I'm far from a
Unicode expert but my understand thus far is that there is no One
solution.

Locales are there for a reason - different places can use different
sort orders and case conversions.  Your blog makes using locales seem
as a detriment, but I'm not sure how you can get around it.

-- 
Cory Nelson
http://www.int64.org


[sqlite] Possible UNICODE LIKE, upper(), lower() function solution

2008-01-03 Thread ioannis
Dear all SQLite3 users,

Recently i have been working on a dictionary style project that had to
work with UNICODE non-latin1 strings, i did try the ICU project but i
wasn't satisfied with the extra baggage that came with it.
I would like to recommend the following possible solution to the long
standing UNICODE issue, that was built in as an ICU alternative
(excluding collation's), and could be easily be included in the SQLite
core as default behavior.

http://ioannis.mpsounds.net/blog/?dl=sqlite3_unicode.c

The above file contains mapping tables for lower(), upper(), title(),
fold()* characters based on UNICODE mapping tables as described
currently by the UNICODE standard v5.1.0 beta, that are used by
functions to transform characters to their respective folding cases.
(These tables were built by a modified version of Loic Dachary builder
in order to included required case transformations)
* UNICODE uses case folding mapping tables to implement non-case
sensitive comparison sequences (eg LIKE).

The above file utilizes the existing ICU infrastructure built in
SQLite in order to activate the extra functionality, to automatically
:
- override the LIKE operation, to support full UNICODE non-case
sensitive comparison
- override upper(), lower(), to support case transformation of UNICODE
characters based on UNICODE mapping tables as described currently by
the UNICODE standard v5.1.0 beta
- provide title() and fold() functions, also based on UNICODE mapping
tables as described currently by the UNICODE standard v5.1.0 beta
- provide unaccent() function, (based on the unac library designed for
linux by Loic Dachary) to decompose UNICODE characters to there
unaccented equivalents in order to perform simpler queries and return
wider range of results. (eg. ά -> α, æ -> ae in the latter example the
string will automatically grow by 1 character point)

In comparison to ICU no collation sequences have been implemented yet.
The above functionalities have been designed to be included/excluded
independently according to specific needs in order to minimize the
size of the library.
The total overhead over the SQLite library size with all functionality
enabled is approximately 70~80KB.

The above file has not been thoroughly tested, but i consider the
implementation to stable.
You can leave comments, bug reports, suggestions on this board or at
http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-support
(PS. I am not an SQLite expert, but i had to improvise on some extent
on this matter.)

Thank you very much.