Re: [sqlite] Proposal for SQLite and non pure ASCII letters

Roger Binns Thu, 12 Mar 2009 00:09:36 -0700

Jean-Christophe Deschamps wrote:
>> You are aware that "standard" SQLite is used in devices with a few
>> kilobytes of memory through workstations and servers with gigabytes of it!
> 
> That's precisely why such approach is interesting!


I apologise for not making myself clear.  Everything that is part of
standard SQLite affects all users of it and imposes testing and
compatibility constraints.  Even people not using it will have to use
OMIT directives and so all the testing will have to be done with that on
and off.

>> As far as I can tell you want some extra "standard" collation sequences
>> and propose shortcuts that will get them mostly right.  And you want
>> someone else to write the code!
> 
> Phew!  It wasn't not my intention to see anyone here go ballistic 
> reading my post.

I wasn't going ballistic - just trying to summarise your request in one
paragraph :-)

>    1) I never pretended at any "standard", just a useful set of features

I quote:

  I'd like to have the group opinion about a feature I would find
  utterly useful in _standard_ SQLite.

>    2) what I said is that it can be made to fit _most_ needs with little
>       requirements in memory and cycles

More accurately it would fit with your needs.  What you are talking
about is collation which means ordering of different strings with
respect to each other.  For SQLite that also means being case sensitive
or insensitive.  Equality also falls out of this.

>    3) I offered (twice) to pay a reasonable fee for such development

That isn't a problem.  It just places you a little further away from the
issues that arise.

> ... if it  implies some branching from mainstream core. 

You won't need to branch - just register as an extension.  It also
doesn't need the SQLite team to implement your proposal - anyone can.

> True, but I think that no extension will ever overload consistanly the 
> macros (!)
> and other deep code (like inside FTS3) involved in REGEX, ORDER, 
> GLOB/LIKE, UPPER...
> Or maybe I overlook something obvious.

You are aware that FTS3 is implemented exactly as an extension?  It
doesn't modify the core of SQLite, just registers as an extension like
anyone else's code could.  Anything FTS3 could do, you can do.

> OTOH, having collation and comparison located in _code_ makes them 
> difficult to port.

They have to be in code at some point.  The default builtin code is very
simplistic, only paying attention to ASCII and just comparing raw bytes
for non-ascii characters.

ICU is also "located in code", but uses many tables to drive what
happens. As far as I can tell from your proposal you would also have
code to do the comparisons and store data to help drive that in the
database somewhere.


> I'm not lobbying,

That is exactly what your first sentence was :-)

> nor asking for good practice guidance. 

The purpose of the "guidance" was to show how to get your proposal to be
a standard part of SQLite.

> I was asking the group their opinion about the usefulness of such feature. 

As described I think it is mostly useless.  Sorting even just for West
European languages is not trivial.  For example some countries have a
different sort order for using names vs in the dictionary.  (This
complexity is part of the reason why ICU is so large.)  As described it
would only get things mostly right.  That is why I suggested documenting
where it gets things wrong - for example grab some random Italian,
Swedish, Flemish etc text and sort using your algorithm and using ICU
and compare the results.  If your algorithm is "wrong" 0.001% of the
time then it is very useful to many people.  If it is "wrong" 40% of the
time then it would only be useful in a limited set of circumstances.

> As far as I can see, it
> would require to
> be part of the core to deliver full power.

I still don't understand that claim.  Both ICU and FTS3 are extensions
and deliver their functionality that way.  Your algorithm could also be
delivered that way.  Once it has demonstrated utility then it would be
far easier to get into the core.

> I understand how difficult it is for some english-only developpers or 
> users, having to
> support code for non-english speakers / writers. I just thought it was 
> fading now.

And some developers have had to work on software that supports users all
over the globe throughout their career and know just how difficult this
stuff is!  And what happens when you get it wrong.

> It doesn't work with "locales" at all.

Errr, yes it does.  You are defining sort order and capitalization that
are to be presented to a user.  That user has a locale.

> It allows the user to declare its own set of characters
> and the way SQLite should handle them for low-level operations.

So the "feature" wouldn't ship with any of the sequences?  Each
developer using the feature would have to come up with their own?  You
won't allow more than one? In case you haven't looked at ICU, it
basically consists of code and data tables which deal with each locale.

> BTW locales are far from perfection. For instance: you have to search 
> text, say an address book in a cellphone, with FTS3 and you know the 
> base may have words or names in a dozen european languages. How would 
> you do?  ICU? Huge and slow, but even then: which "locale" would you use?

When you display stuff to a user, you are always using a locale.  As a
developer you need to be aware and do whatever the right thing is.  For
your example I would have the locale set by the user of the cellphone.
If they say French then I'd use the French collation rules.  If they
said German then I'd use the German rules (which differ for phone books
vs other uses).  The display would end up in a different order.

Note that ICU would get that correct today.  It is only huge and slow
because of all the locales it supports and the idiosyncrasies of each
locale. But it does get the answer right.  Other approximations such as
your proposal would only get the answers mostly right.

> Perhaps the best way is practice: what's the way to find this guy named 
> Éric or is it éric, or Eric, or eric? He lives in MÜNCHEN, München, 
> MUNCHEN, Munchen or ... Munich.

We all agree that it is difficult.  Basically there are 3 solutions:

1 - Do ASCII only, other locales are deliberately/obviously wrong
2 - A small light algorithm that gets some locales somewhat right
3 - Get all locales 100% correct

SQLite ships with #1 and #3.  For #2 it is important just how right (or
wrong) it gets various locales.

The Unicode Collation Algorithm - http://unicode.org/reports/tr10/ -
also tries #2 although you'd probably consider it to much code/data.
The begining of that page also describes many locale dependent issues
just like your example, and why the locale matters.  ICU started out as
the UCA and now has lots more code and data.

> I use more than a strictly defined locale
> "locale" and less than a world locale!

But that only solves your specific situation!  The more specific the
less value it has to anyone else.

> Now is perhaps the time for something else.

We already have two choices - get only ASCII 100% right and get all
locales 100% right (ICU).  Proposing something inbetween should at least
give an idea of accuracy.  For example if you said 99% right for 90% of
Western European locales with only 100 lines of code and 1000 bytes of
data then it is a good proposal.

Roger
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proposal for SQLite and non pure ASCII letters

Reply via email to