Re: [sqlite] Unicode collation

2007-06-28 Thread Trevor Talbot

On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote:


It seems more natural to use the standard C locale names to me (the
usual "pt_PT" and "pt_BR" for  Portuguese/Portugal culture and
Portuguese/Brazil culture) , but I'm open to suggestions when that
problem arise, and I'm sure there are already standards we can follow
in relation to that.


RFCs 4646 and 4647 cover this convention.  They get rather verbose
about it, but it boils down to using ISO-defined language and country
codes in a particular format.  Most internet protocols (http etc) use
the same convention, so we'd be in good company.

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Nuno Lucas

On 6/28/07, Trevor Talbot <[EMAIL PROTECTED]> wrote:

On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote:

> One thing I noticed is that "collations" != "case change". This will
> not make it possible to use UPPER/LOWER with the same data on the
> table, but maybe we can work on something in that respect, also.

If the intent is case-insensitive comparisons a la LIKE, note that
"upper" and "lower" operations alone won't suffice, because in some
scripts a single character maps to multiple characters in one
direction.  Unicode defines a "case folding" operation for such uses,
which does full expansion no matter which direction it is
(lower->upper or upper->lower), allowing the result to be handled with
a simple memcmp()-style binary comparison.


Right, an example is the german "straße" -> "STRASSE" (the 'ß' - beta
- character being replaced by "SS" in uppercase, for those who can't
see the mail properly).


It should certainly be possible to cover those operations too, it's
just as nuanced as the collation work :)


Yes. I'll try to think on this too, but not the big priority.


Best regards,
~Nuno Lucas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Nuno Lucas

On 6/28/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:

> My idea is to implement the UCA collation in SQLite (with the usual
> OMIT_* #ifdef's), using the DUCET table as base, and if people need
> the tailoring part for localized sorting, have it be optional by
> having a "sqlite_collation_data" table with the needed locale data
> included on the database.

That would certainly be great if this is implemented. Note, however,
that it doesn't fully solve the issues described in this thread - i.e.
if you create a DB by some SQLite version and then use it by a newer
version, where some elements were
added/modified in DUCET, indexes of these DBs wouldn't be compatible.
It can be resolved in several ways, e.g. as suggested to have all
DUCET data stored in a special table in SQLite database. It's just a
matter of choosing a well-balanced solution...


There are 2 problems: UCA changes and DUCET (and/or other locale data)
changes. DUCET and locale data is in the database table, so it can
only change by user intervention, meaning it's his fault if done
without rebuilding the affected table index(es).

UCA changes are more problematic, but they are less frequent (it seems
there was a minor change between Unicode 4.0 and 4.1, though).

I don't see any good solution for this other than having an extra
field on the database file (or in the collation data tables) with the
UCA version and advise the user when using an UCA version different
from the one it was created (or last used, as by default no collation
data is needed).

It's very probable that not many users are actually bothered by this
(the algorithm is probably not changing much over time, and probably
not in incompatible ways for most locales).

Maybe we can just make sure "PRAGMA consistency_check" notices if
there is an inconsistency and that VACUUM will fix this.


Btw, even if this is implemented, there is still a need for a
standardization such new collation names. E.g. that new language
neutral collation could be called Unicode or DUCET? And how about
language specific collations? After some thoughts, I'd suggest
something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e.
some characters are properly ignored, given for example by an ordering
of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode
strings - i.e. special characters aren't ignored, so that above words
would be ordered as 'co-op', 'con', 'coop').


I don't find this particularly important, because the collation name
has to be on the tables, so it can be called "DEFAULT" and have only
data for the "fr_FR" locale on the "default" tables (including the
DUCET base embedded) on embedded devices.

I'm thinking there will be a "reference" database with all locale
data, and it's up to users to use it "as is" or build their own (maybe
just rename locales).

It seems more natural to use the standard C locale names to me (the
usual "pt_PT" and "pt_BR" for  Portuguese/Portugal culture and
Portuguese/Brazil culture) , but I'm open to suggestions when that
problem arise, and I'm sure there are already standards we can follow
in relation to that.

Well, I will probably only have time to actually put words into code
next weekend, so I will say something when I have source code to show.


Best regards,
~Nuno Lucas



Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Trevor Talbot

On 6/28/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:


Btw, even if this is implemented, there is still a need for a
standardization such new collation names. E.g. that new language
neutral collation could be called Unicode or DUCET? And how about
language specific collations? After some thoughts, I'd suggest
something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e.
some characters are properly ignored, given for example by an ordering
of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode
strings - i.e. special characters aren't ignored, so that above words
would be ordered as 'co-op', 'con', 'coop').


At least part of the collation name would come from the special
database table, so it may be useful to have some general prefix that
indicates sqlite is using the internal algorithm + table style
mechanism for it.  E.g. "IUC:" for "Included Unicode Collation"
or something.

That's probably not a valid identifier currently, but you get the idea...

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Trevor Talbot

On 6/28/07, Nuno Lucas <[EMAIL PROTECTED]> wrote:


One thing I noticed is that "collations" != "case change". This will
not make it possible to use UPPER/LOWER with the same data on the
table, but maybe we can work on something in that respect, also.


If the intent is case-insensitive comparisons a la LIKE, note that
"upper" and "lower" operations alone won't suffice, because in some
scripts a single character maps to multiple characters in one
direction.  Unicode defines a "case folding" operation for such uses,
which does full expansion no matter which direction it is
(lower->upper or upper->lower), allowing the result to be handled with
a simple memcmp()-style binary comparison.

It should certainly be possible to cover those operations too, it's
just as nuanced as the collation work :)

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Jiri Hajek

My idea is to implement the UCA collation in SQLite (with the usual
OMIT_* #ifdef's), using the DUCET table as base, and if people need
the tailoring part for localized sorting, have it be optional by
having a "sqlite_collation_data" table with the needed locale data
included on the database.


That would certainly be great if this is implemented. Note, however,
that it doesn't fully solve the issues described in this thread - i.e.
if you create a DB by some SQLite version and then use it by a newer
version, where some elements were
added/modified in DUCET, indexes of these DBs wouldn't be compatible.
It can be resolved in several ways, e.g. as suggested to have all
DUCET data stored in a special table in SQLite database. It's just a
matter of choosing a well-balanced solution...

Btw, even if this is implemented, there is still a need for a
standardization such new collation names. E.g. that new language
neutral collation could be called Unicode or DUCET? And how about
language specific collations? After some thoughts, I'd suggest
something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e.
some characters are properly ignored, given for example by an ordering
of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode
strings - i.e. special characters aren't ignored, so that above words
would be ordered as 'co-op', 'con', 'coop').

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-28 Thread Nuno Lucas

On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:

>  Trevor proposed method doesn't need that, because the collation data
> goes with the database (which is the big advantage to me). And if you
> want to use another collation, you only need to import the collation
> data needed for that locale.

Well, Trevor's proposal sounds good indeed, but realistically, I don't
see any big chances it to be implemented - collations aren't that
simple and maintaining all the related issues without relying on some
external code is probably too much to expect.


Well, MySQL does it since v4.1, and from what I googled (I don't have
the links on this PC), the collation data needed with the adjustments
(tailoring) for the different locales [1] is not that much (and it
doesn't change the algorithm code, only data tables are needed).


From what I have seen, the Unicode Collation Algorithm (UCA) [2] is

already a good base for general ordering, and it seems the European
Ordering Rules (EOR) [3] is similar to UCA with the Default Unicode
Collation Element Table - DUCET, which already is good enough for many
people (including 90% of my needs).

Yesterday it looked like unicode.org was down  and I haven't had the
chance to look at the UCA report (UTS #10) in depth, but I will try to
come up with a proposal for sqlite.

My idea is to implement the UCA collation in SQLite (with the usual
OMIT_* #ifdef's), using the DUCET table as base, and if people need
the tailoring part for localized sorting, have it be optional by
having a "sqlite_collation_data" table with the needed locale data
included on the database.

One thing I noticed is that "collations" != "case change". This will
not make it possible to use UPPER/LOWER with the same data on the
table, but maybe we can work on something in that respect, also.


Regards,
~Nuno Lucas

[1] http://developer.mimer.com/collations/charts/index.tml
[2] http://www.unicode.org/reports/tr10/
[3] http://en.wikipedia.org/wiki/European_Ordering_Rules



Jiri


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Jiri Hajek

> Actually, reading one of the links you posted
> (http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx -
> everybody please read it before continuing in discussion), I got (what
> I think is a great) idea: Let's include version information about
> collation algorithm we are using. So, extending my previous
> suggestion, the SQLite collation title would look like
> 'en_AU_Win_2_1', where the individual parts mean:
>  'en_AU' - obviously a sort order
>  'Win' - that we are basing the ordering on default Windows methods
> (otherwise could be e.g. 'ICU')
>  '2_1' - version string as retrieved from GetNLSVersion() function and
> NLSVERSIONINFO struct (see the link above for details).

I don't see how that is different than the previous difficulty of
having incompatible databases between systems. The only new thing is
we can check it, but the database will have inconsistent behaviour
anyway.


As I wrote, my main intention was to handle incompatibilities among
applications reading a database on a given system - and this would
handle the issue perfectly.


> This way we are absolutely sure that we don't corrupt SQLite database.
> While I'm working on a single system, the version string wouldn't
> change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1'
> collation on a system NLSVERSIONINFO returns version 2.3, I know I
> have to reindex, change collation title to 'en_AU_Win_2_3' and then
> can safely use the database.

For small databases, the task of re-indexing is not big, but you are
forgetting you can have a database in a shared network folder, used by
PC's in different parts of the world and even different OSs (with
samba/cifs). That's why I like Trevor's idea so much.


You are right in this, my proposal handles some problems, but can't
handle this situation. That said, it at least allows sharing databases
under compatible platforms - which is still an improvement over the
current situation where every application has its own collation and
there's absolutely no compatibility.


 Trevor proposed method doesn't need that, because the collation data
goes with the database (which is the big advantage to me). And if you
want to use another collation, you only need to import the collation
data needed for that locale.


Well, Trevor's proposal sounds good indeed, but realistically, I don't
see any big chances it to be implemented - collations aren't that
simple and maintaining all the related issues without relying on some
external code is probably too much to expect.

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Nuno Lucas

On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:

Actually, reading one of the links you posted
(http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx -
everybody please read it before continuing in discussion), I got (what
I think is a great) idea: Let's include version information about
collation algorithm we are using. So, extending my previous
suggestion, the SQLite collation title would look like
'en_AU_Win_2_1', where the individual parts mean:
 'en_AU' - obviously a sort order
 'Win' - that we are basing the ordering on default Windows methods
(otherwise could be e.g. 'ICU')
 '2_1' - version string as retrieved from GetNLSVersion() function and
NLSVERSIONINFO struct (see the link above for details).


I don't see how that is different than the previous difficulty of
having incompatible databases between systems. The only new thing is
we can check it, but the database will have inconsistent behaviour
anyway.


This way we are absolutely sure that we don't corrupt SQLite database.
While I'm working on a single system, the version string wouldn't
change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1'
collation on a system NLSVERSIONINFO returns version 2.3, I know I
have to reindex, change collation title to 'en_AU_Win_2_3' and then
can safely use the database.


Trevor proposed method doesn't need that, because the collation data
goes with the database (which is the big advantage to me). And if you
want to use another collation, you only need to import the collation
data needed for that locale.


It seems to me to be a really cross-platform solution and what's best
- without any coding on SQLite side necessary (even though possible to
handle some parts of this proposal internally).


On the articles you have pointed (but I don't have a link right now)
also states that SQL Server and the Jet engine don't use the system
collation functions, but instead have their own integrated collation
system (based on the Windows one, off course) exactly for the same
reason.

For small databases, the task of re-indexing is not big, but you are
forgetting you can have a database in a shared network folder, used by
PC's in different parts of the world and even different OSs (with
samba/cifs). That's why I like Trevor's idea so much.


Best regards,
~Nuno Lucas


Jiri


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Jiri Hajek

After thinking a bit, it occurs to me that there's a compromise for
the Unicode case that might be workable.  The algorithm for collation
is pretty stable, it's just the locale data that's the problem.  If
SQLite understands the algorithm, then locale data can go into special
tables in the database itself.

Applications manipulating the database schema would need to have the
relevant collation data on hand to fill in the database, but other
apps concerned with only the data could operate without any special
knowledge.  This approach keeps the database internally
self-consistent while avoiding platform and versioning issues.


Something like this could probably work, but it still seems to me that
we are assigning SQLite tasks that OS or some other libraries should
handle.

Actually, reading one of the links you posted
(http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx -
everybody please read it before continuing in discussion), I got (what
I think is a great) idea: Let's include version information about
collation algorithm we are using. So, extending my previous
suggestion, the SQLite collation title would look like
'en_AU_Win_2_1', where the individual parts mean:
'en_AU' - obviously a sort order
'Win' - that we are basing the ordering on default Windows methods
(otherwise could be e.g. 'ICU')
'2_1' - version string as retrieved from GetNLSVersion() function and
NLSVERSIONINFO struct (see the link above for details).

This way we are absolutely sure that we don't corrupt SQLite database.
While I'm working on a single system, the version string wouldn't
change. Then, e.g. if I open SQLite database with 'en_AU_Win_2_1'
collation on a system NLSVERSIONINFO returns version 2.3, I know I
have to reindex, change collation title to 'en_AU_Win_2_3' and then
can safely use the database.

It seems to me to be a really cross-platform solution and what's best
- without any coding on SQLite side necessary (even though possible to
handle some parts of this proposal internally).

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Nuno Lucas

On 6/27/07, Trevor Talbot <[EMAIL PROTECTED]> wrote:

On 6/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

> > > So the choices seem to be:
> > >
> > >  (1) Databases that corrupt if you move across platforms.
> > >  (2) A 10MB database engine
> > >  (3) Leave things as they are

> OK.  Here is a crazy idea for consideration:
>
> You know that you can create a custom collating function as a
> DLL or shared library and load it at runtime, right?  This has
> been a capability since version 3.3.7.  Suppose we define a
> special table in the database file that is designed to hold
> DLLs and/or shared libraries.

After thinking a bit, it occurs to me that there's a compromise for
the Unicode case that might be workable.  The algorithm for collation
is pretty stable, it's just the locale data that's the problem.  If
SQLite understands the algorithm, then locale data can go into special
tables in the database itself.

Applications manipulating the database schema would need to have the
relevant collation data on hand to fill in the database, but other
apps concerned with only the data could operate without any special
knowledge.  This approach keeps the database internally
self-consistent while avoiding platform and versioning issues.

It's a thought.


I love your idea and seems the most promising by far.

If we could have it in a way such that people who only use a specific
set of locales (like applications that are only internationalized for
the german, spanish and italian market, for example) could only have
the locale data they need (like ignore hebrew or chinese), I believe
that would be the best of all worlds.

Maybe we could have a database (the one with the 10MB in size) in the
contrib area with all the locale data and people would then import the
locales they need into their own databases on creation.

Indexes would continue to be coherent across platforms and different
locales (with the proper care).

This is good even for embedded devices, as they are usually deployed
on a single locale, and can have only the locale data they need.


Best regards,
~Nuno Lucas

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Trevor Talbot

On 6/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


> > So the choices seem to be:
> >
> >  (1) Databases that corrupt if you move across platforms.
> >  (2) A 10MB database engine
> >  (3) Leave things as they are



OK.  Here is a crazy idea for consideration:

You know that you can create a custom collating function as a
DLL or shared library and load it at runtime, right?  This has
been a capability since version 3.3.7.  Suppose we define a
special table in the database file that is designed to hold
DLLs and/or shared libraries.


After thinking a bit, it occurs to me that there's a compromise for
the Unicode case that might be workable.  The algorithm for collation
is pretty stable, it's just the locale data that's the problem.  If
SQLite understands the algorithm, then locale data can go into special
tables in the database itself.

Applications manipulating the database schema would need to have the
relevant collation data on hand to fill in the database, but other
apps concerned with only the data could operate without any special
knowledge.  This approach keeps the database internally
self-consistent while avoiding platform and versioning issues.

It's a thought.


On 6/27/07, Joe Wilson <[EMAIL PROTECTED]> wrote:


The solution is obvious:

  http://en.wikipedia.org/wiki/Esperanto


:D

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Trevor Talbot

On 6/27/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:


I tried to think about this in as many details as possible and I'd say
that the original problem that I'd like to solve here (i.e. collation
of Unicode character sets) doesn't require any special complex
handling in SQLite. We really should just define how will individual
collations be named, so that Unicode SQLite databases become portable.



The only argument against this was that individual implementations of
Unicode standard (i.e. mainly internal Windows methods or ICU library)
could differ. However, is it really a problem? I'd say that it isn't.
Unicode specifies exactly how characters should be sorted and so _if_
there really are any differences between Windows and ICU
implementations, they should be considered as a bug and be solved
there. I.e. it isn't anything SQLite should be responsible for.


No, it doesn't.  See _The Unicode Standard 5.0_ chapter 5 section 16
(http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf).  Unicode
Technical Standard #10 (http://www.unicode.org/reports/tr10/)
specifies how collections work, and provides a default table, but note
carefully that the table is versioned and contains incompatible
changes between versions -- see the Modifications section at the end.
It does not deal with locale-specific collation at all.

I'll also refer you to the blog of Michael Kaplan, someone involved in
both the Unicode Consortium and the NLS group at Microsoft:
http://blogs.msdn.com/michkap/search.aspx?q=collation

On platform compatibility:
http://blogs.msdn.com/michkap/archive/2004/11/28/271121.aspx
http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx

What Microsoft does with their own SQL Server:
http://blogs.msdn.com/michkap/archive/2005/03/11/394359.aspx

The same issues apply to anything you try to do with SQLite.

Unicode is nowhere near as simple as people would like it to be, which
is why everyone picks a specific platform with a known state and
sticks to it like glue.  There be dragons here.

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-27 Thread Jiri Hajek

A moments reflection convinces me that this is not a good idea
as stated.  But perhaps it can be the seed for a idea that will
actually work.  Is there some way of adding customizations to
the database file itself, or perhaps to a separate file in a
standard place the SQLite always knows to look, so that
custom enhancements and extensions to SQLite can be accomodated
in standard tools?  Worth thinking about, perhaps


I tried to think about this in as many details as possible and I'd say
that the original problem that I'd like to solve here (i.e. collation
of Unicode character sets) doesn't require any special complex
handling in SQLite. We really should just define how will individual
collations be named, so that Unicode SQLite databases become portable.

The only argument against this was that individual implementations of
Unicode standard (i.e. mainly internal Windows methods or ICU library)
could differ. However, is it really a problem? I'd say that it isn't.
Unicode specifies exactly how characters should be sorted and so _if_
there really are any differences between Windows and ICU
implementations, they should be considered as a bug and be solved
there. I.e. it isn't anything SQLite should be responsible for.

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-25 Thread Joe Wilson
--- [EMAIL PROTECTED] wrote:
> OK.  Here is a crazy idea for consideration:
> 
> You know that you can create a custom collating function as a
> DLL or shared library and load it at runtime, right?  This has
> been a capability since version 3.3.7.  Suppose we define a
> special table in the database file that is designed to hold
> DLLs and/or shared libraries.  Perhaps something like this:
> 
> CREATE TABLE sqlite_shared_libs(
>architecture TEXT,
>library BLOB
> );
> 
> You can put DLLs or shared libraries for any custom collating 
> functions (also any custom SQL functions or virtual table 
> implementations) into this table and SQLite will load them
> automatically which is starts up.  This allows you to create
> any non-standard database extensions you want and make them
> completely portable to any database editor or other tool that
> wants to access the database.

Christian Werner's SQLiteODBC driver does something similar to
your proposal except it compiles C function extensions on the fly
using TCC:

  http://www.mail-archive.com/sqlite-users@sqlite.org/msg21945.html

  http://sqlite.org/contrib/download/sqlite%2Btcc.tar.gz?get=13

The idea could be generalized to run any arbitrary binary and/or
script to make a shared library.

See also the sqlite_extensions proposal at the bottom of this page:

  http://www.sqlite.org/cvstrac/wiki?p=LoadableExtensions

It takes into account extension library load/unload order as well 
as the name the names of the function entry points.


  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



RE: [sqlite] Unicode collation

2007-06-25 Thread James Dennett

> -Original Message-
> From: Jiri Hajek [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 25, 2007 8:14 AM
> To: sqlite-users@sqlite.org
> Subject: Re: [sqlite] Unicode collation
> 
> > So the choices seem to be:
> >
> >  (1) Databases that corrupt if you move across platforms.
> >  (2) A 10MB database engine
> >  (3) Leave things as they are
> >
> > I prefer to go with (3)
> 
> Yes, based on the facts I agree that (3) is the best way to go - a
> developer can decide whether to use Windows methods only, or be
> cross-platform and use ICU or anything else.
> 
> However, it isn't what I mean. I suggest to publish some kind of a
> standard that database developers would use. I.e. SQLite web would
> define, that if you want to make a Unicode database that can be
> openned by other applications, you should define collation e.g. as
> 'en_AU' for Australian English. This way, if I want let users of my
> application to open DB in e.g. in some database editor, I can use this
> standard and users will be able to open it in many applications
> following the standard.
> 
> So, what I propose aren't actually any changes in SQLite, but rather
> publishing something that would guide all SQLite developers.

I thought that was what was being responded to; publishing such a
recommendation leads to problem (1), because the collation rules will
not match in different environments.

-- James


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-25 Thread drh
"Jiri Hajek" <[EMAIL PROTECTED]> wrote:
> > So the choices seem to be:
> >
> >  (1) Databases that corrupt if you move across platforms.
> >  (2) A 10MB database engine
> >  (3) Leave things as they are
> >
> > I prefer to go with (3)
> 
> Yes, based on the facts I agree that (3) is the best way to go - a
> developer can decide whether to use Windows methods only, or be
> cross-platform and use ICU or anything else.
> 
> However, it isn't what I mean. I suggest to publish some kind of a
> standard that database developers would use. I.e. SQLite web would
> define, that if you want to make a Unicode database that can be
> openned by other applications, you should define collation e.g. as
> 'en_AU' for Australian English. This way, if I want let users of my
> application to open DB in e.g. in some database editor, I can use this
> standard and users will be able to open it in many applications
> following the standard.
> 
> So, what I propose aren't actually any changes in SQLite, but rather
> publishing something that would guide all SQLite developers.
> 

OK.  Here is a crazy idea for consideration:

You know that you can create a custom collating function as a
DLL or shared library and load it at runtime, right?  This has
been a capability since version 3.3.7.  Suppose we define a
special table in the database file that is designed to hold
DLLs and/or shared libraries.  Perhaps something like this:

CREATE TABLE sqlite_shared_libs(
   architecture TEXT,
   library BLOB
);

You can put DLLs or shared libraries for any custom collating 
functions (also any custom SQL functions or virtual table 
implementations) into this table and SQLite will load them
automatically which is starts up.  This allows you to create
any non-standard database extensions you want and make them
completely portable to any database editor or other tool that
wants to access the database.

A moments reflection convinces me that this is not a good idea
as stated.  But perhaps it can be the seed for a idea that will
actually work.  Is there some way of adding customizations to
the database file itself, or perhaps to a separate file in a
standard place the SQLite always knows to look, so that 
custom enhancements and extensions to SQLite can be accomodated
in standard tools?  Worth thinking about, perhaps

--
D. Richard Hipp <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-25 Thread Jiri Hajek

So the choices seem to be:

 (1) Databases that corrupt if you move across platforms.
 (2) A 10MB database engine
 (3) Leave things as they are

I prefer to go with (3)


Yes, based on the facts I agree that (3) is the best way to go - a
developer can decide whether to use Windows methods only, or be
cross-platform and use ICU or anything else.

However, it isn't what I mean. I suggest to publish some kind of a
standard that database developers would use. I.e. SQLite web would
define, that if you want to make a Unicode database that can be
openned by other applications, you should define collation e.g. as
'en_AU' for Australian English. This way, if I want let users of my
application to open DB in e.g. in some database editor, I can use this
standard and users will be able to open it in many applications
following the standard.

So, what I propose aren't actually any changes in SQLite, but rather
publishing something that would guide all SQLite developers.

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-25 Thread drh
"Jiri Hajek" <[EMAIL PROTECTED]> wrote:
> Ok, drh, could you share your thoughts about this. Could something
> like this, i.e. some kind of recommendation be created, so that all
> applications are consistent in Unicode support?
> 

Dan's observation is correct.  It seems to me that the
collating sequences generated by version editions of Windows
are unlikely to be identical.  Almost certainly the collating
sequence for Windows will be different from ICU.  So if we
were to make any of the collating sequences standard, then
suddenly SQLite databases are no longer cross-platform.

We could build in our own collating functions.  But the
collation tables are huge.  The 250KB SQLite library would
mushroom into the 10MB range.

So the choices seem to be:

  (1) Databases that corrupt if you move across platforms.
  (2) A 10MB database engine
  (3) Leave things as they are

I prefer to go with (3)

--
D. Richard Hipp <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-25 Thread Jiri Hajek

Ok, drh, could you share your thoughts about this. Could something
like this, i.e. some kind of recommendation be created, so that all
applications are consistent in Unicode support?

Thanks,
Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-23 Thread Dan Kennedy
On Sat, 2007-06-23 at 10:56 +0200, Jiri Hajek wrote:
> > The reason is as you've surmised. Not all systems have full unicode
> > support (I'm not sure, but if I had to guess, I would say very few
> > systems do). Including an implementation with SQLite would bloat
> > the library to at least several times it's current size.
> 
> I know really well only Windows where it's only about using
> CompareString() function, i.e. almost no code in SQLite.
> 
> > Another reason is that it's a pretty complex topic. Supporting
> > most European languages would probably be possible without too
> > much trouble, but once you get into Asian and Middle-eastern
> > languages I think it's much harder.
> 
> As far as I know, there wouldn't be any complexity on SQLite's side -
> it's only about calling proper methods (be it CompareString() on
> Windows or ICU methods elsewhere), i.e. again pretty much no code
> needed in SQLite.
> 
> > There is an extension packaged with SQLite sources that uses the
> > ICU library to provide locale dependent collation sequences and
> > case folding. See here for details:
> 
> ICU is nice, but pretty large. Since I develop for Windows, I'd rather
> not distribute it with my application considering that this is alredy
> provided in Windows in reasonable quality.
> 
> Anyway, I guess that the question isn't mainly about how to implement
> this in SQLite, but about the problem that SQLite doesn't define any
> standard how to handle Unicode. Currently, any application that needs
> to work with Unicode data has to define its own collation and name it
> 'tr_TR', 'turkish', 'MyTurkish', or any other way. The result is a big
> mess and no chance of opening SQLite database in other application
> than it was designed for.
>
> So, why don't we (or you - SQLite developers) define how to name
> collations (e.g. that 'tr_TR', 'en_AU', ... standard?) and then every
> database complying this would be perfectly portable.

That's not too bad an idea. One thing to watch out for is that if
the definition of a collation sequence that you have used to create
an index varies even slightly from machine to machine (say from 
win98 to windows vista), you are headed for database corruption.

For that reason I'd be a bit reluctant to encourage people to use
more than one implementation of a named collation sequence. 

Dan.




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-23 Thread Jiri Hajek

The reason is as you've surmised. Not all systems have full unicode
support (I'm not sure, but if I had to guess, I would say very few
systems do). Including an implementation with SQLite would bloat
the library to at least several times it's current size.


I know really well only Windows where it's only about using
CompareString() function, i.e. almost no code in SQLite.


Another reason is that it's a pretty complex topic. Supporting
most European languages would probably be possible without too
much trouble, but once you get into Asian and Middle-eastern
languages I think it's much harder.


As far as I know, there wouldn't be any complexity on SQLite's side -
it's only about calling proper methods (be it CompareString() on
Windows or ICU methods elsewhere), i.e. again pretty much no code
needed in SQLite.


There is an extension packaged with SQLite sources that uses the
ICU library to provide locale dependent collation sequences and
case folding. See here for details:


ICU is nice, but pretty large. Since I develop for Windows, I'd rather
not distribute it with my application considering that this is alredy
provided in Windows in reasonable quality.

Anyway, I guess that the question isn't mainly about how to implement
this in SQLite, but about the problem that SQLite doesn't define any
standard how to handle Unicode. Currently, any application that needs
to work with Unicode data has to define its own collation and name it
'tr_TR', 'turkish', 'MyTurkish', or any other way. The result is a big
mess and no chance of opening SQLite database in other application
than it was designed for.

So, why don't we (or you - SQLite developers) define how to name
collations (e.g. that 'tr_TR', 'en_AU', ... standard?) and then every
database complying this would be perfectly portable.

Jiri

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Unicode collation

2007-06-23 Thread Dan Kennedy
On Fri, 2007-06-22 at 18:57 +0200, Jiri Hajek wrote:
> Hello,
> 
> I wonder whether there are any plans to include internally proper
> Unicode comparisons? Don't get me wrong, I think that it's great that
> SQLite supports custom collations, there's absolutely no problem to
> handle it in internally for my database, but problem is that if I
> define UNICODE collation, no other application knows about it and so
> users can't open it in any SQLite DB editor.
> 
> Nowadays applications without Unicode support slowly become rare, as I
> see, I'm not the first one asking for this kind of support in SQLite.
> Is there any technical reason why not to include UNICODE and e.g.
> IUNICODE (for case-insensitive comparisons) collations in SQLite? Is
> it because of some systems that don't have (full) Unicode support? In
> such a case, I guess that it could be a compile-time option.

The reason is as you've surmised. Not all systems have full unicode
support (I'm not sure, but if I had to guess, I would say very few
systems do). Including an implementation with SQLite would bloat
the library to at least several times it's current size.

Another reason is that it's a pretty complex topic. Supporting
most European languages would probably be possible without too
much trouble, but once you get into Asian and Middle-eastern
languages I think it's much harder.

There is an extension packaged with SQLite sources that uses the
ICU library to provide locale dependent collation sequences and
case folding. See here for details:

  http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/icu/README.txt

Dan.



-
To unsubscribe, send email to [EMAIL PROTECTED]
-