Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread John Machin
On 20/03/2009 1:55 AM, Mail.sqlite wrote:
> Roger:

> 4. Your comments sound to me that you are living in a 7bit ASCII country, do 
> you?

>> Mail.sqlite wrote:
>>> - Please, let us try to bring down the discussion to the intended solution
  - a simple way to define and use a "user defined" collating for 8 bit 
ASCII characters!

FYI there are no such things as "7bit ASCII country" or "8 bit ASCII 
character". ASCII is a 7-bit character set, everywhere.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Jay A. Kreibich
On Thu, Mar 19, 2009 at 12:31:00PM -0700, Noah Hart scratched on the wall:

> Rather my point is that it would be of benefit if SQLite would have
> some built in mechanism for a rule-based collation.

  SQLite *does* have a mechanism for "rule-based" or any-other based
  collations:  User defined collations.  They work the exact same way
  built-in collations work.  The API is simple and straight forward.
  If you want to show the SQLite community this is a Good Idea, there
  is one simple, easy, fast solution: stop writing posts and start
  writing code.

  I also don't buy the black and white argument that this needs to go
  into the core or it isn't worth doing.  FTS wasn't in the core
  distribution for ages.  Neither was the ICU stuff.  They still aren't
  on by default...  Yet plenty of people were able to do great work with
  it and -- most importantly -- prove their stability, worth, and
  limitations so that they were accepted by the community and integrated
  into the core distribution.
  
  The truth is dynamic modules work on just about every major platform
  and are trivial to write.  Nearly anything that will allow you to
  execute a raw SQL command will let you attach a module.  If you don't
  like modules, a custom version of SQLite or a SQLite DLL/so/dylib
  is trivial to compile.  Even if your gee-wiz 3D GUI editor might not
  work out of the box, that's not the point.  No matter how strongly
  anyone feels about this, there is a flat zero chance of it going into
  the core distribution until someone writes some code.
  
  Get the code working, be able to show that it works and has worth,
  then come talk.  Even if you believe it must be in the core, you
  still need working code.  So let's see some.

  We can sit here and argue and bicker until the cows come home, or
  someone can sit down, write some code and show/prove that the
  extension is (or isn't!) useful and productive.  But until someone
  produces working code, all we have is a lot of hot air on **both**
  sides of the table.
 

  To quote the Internet Engineering Task Force:

  "We reject: Kings, Presidents, and voting.
   We believe in: rough consensus and running code."


  Show us the code.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Noah Hart
Nico wrote:

>> Number 1, the database is no longer portable. The only solution to
>> this is to include the functionality in the core.

>Yes but, there is no single Unicode collation.  Collation is
>language-specific, even when using Unicode.  Thus you're asking that
>SQLite3 have a plethora of built-in Unicode collations.

I do understand the issues, but I think you misunderstand my intent.
I'm not asking that SQLite3 have any built-in Unicode collations.
Rather my point is that it would be of benefit if SQLite would have 
some built in mechanism for a rule-based collation.

>> more general solution would be to design it around a sqlite_collation
>> master table in the database. An application developer (not the
SQLite
>> team) would be responsible to define and populate their "user
defined"
>> collation.

>It's more complex than you think.  You need to keep Unicode
>normalization forms in mind and you need to deal with decomposed
>characters no matter what (since not all future additions to Unicode
>will include pre-composed forms, and NFC is closed to new pre-composed
>forms anyways), which means multi-codepoint sequences need to be
>accounted for in the collation.  You'd very quickly realize that it'd
be
>even simpler for you if SQLite3 just had built-in collations for all
the
>relevant languages.  And once more SQLite3 would no longer be light.
>Perhaps when built with ICU SQLite3 could make it trivial to load any
of
>those collations.

I'm not suggesting that it is not a complex problem.  

SQLite does a very nice job of taking SQL statements and decomposing 
them into a form that a byte engine can process.  I feel that the 
collations problem could also be addressed in this way with a rules 
based table along with some underlying data that a application developer

(not the SQLite development team) could populate.  

If the "user defined" collation wouldn't work for a particular 
application, then it would not need to be used at all.

If Richard feels that this is something that has no value, I'm more 
than willing to drop the whole idea.  

Regards,

Noah



CONFIDENTIALITY NOTICE: 
This message may contain confidential and/or privileged information. If you are 
not the addressee or authorized to receive this for the addressee, you must not 
use, copy, disclose, or take any action based on this message or any 
information herein. If you have received this message in error, please advise 
the sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Nicolas Williams
On Thu, Mar 19, 2009 at 09:52:55AM -0700, Noah Hart wrote:
> I've been reading and thinking about this topic for a while, and would
> like to add my thoughts.
> 
> I realize that we don't "vote" on features, but I feel that this type
> of idea has merit.
> 
> It is true, that SQLite has user defined collations, and a extension
> could be registered, but the problem with that is twofold:
> 
> Number 1, the database is no longer portable. The only solution to
> this is to include the functionality in the core.

Yes but, there is no single Unicode collation.  Collation is
language-specific, even when using Unicode.  Thus you're asking that
SQLite3 have a plethora of built-in Unicode collations.

And you'll probably want Unicode strings normalized for indexing and
comparison.  And...

And SQLite3 would no longer be light.  You can add Unicode collations
using the user-defined collation function and whatever Unicode collation
implementation you might have (e.g., ICU).

> Number 2, your platform may not support the sqlite3_create_collation
> interface. For example, Firefox now includes SQLite. Unfortunately,
> while Firefox supports user defined functions, their implementation
> does not support user defined collations.  

I'd call that a bug in Firefox.

> Someone commented that the US lives in a 7-bit world.  

But that's not true.  Even people who only read and write English can
barely get by with just US-ASCII (if nothing else a lot of webpages
would display as so many question marks if the browser didn't support
anything other than US-ASCII).  And there are plenty of multi-lingual
people in the U.S.

> This means that the other 6 billion people on the planet do not.

There are lots of non-Unicode character and code sets.  The rest of the
world is not necessarily in a better position than the English-speaking
world.  Unicode is a solution, and the best one at that.

> This creates a real problem for me.  I am writing a foreign language
> Firefox extension, and the issue of sorting is critical, since Firefox
> uses Unicode sorting, which does not "sort" (based on my rules)
> correctly.   This means I have no way to correct the sorting, except
> in the display routines.
> 
> That being said, I would not limit this feature to 8bit locales.  A

8-bit is so 1980s :)

> more general solution would be to design it around a sqlite_collation
> master table in the database. An application developer (not the SQLite
> team) would be responsible to define and populate their "user defined"
> collation.

It's more complex than you think.  You need to keep Unicode
normalization forms in mind and you need to deal with decomposed
characters no matter what (since not all future additions to Unicode
will include pre-composed forms, and NFC is closed to new pre-composed
forms anyways), which means multi-codepoint sequences need to be
accounted for in the collation.  You'd very quickly realize that it'd be
even simpler for you if SQLite3 just had built-in collations for all the
relevant languages.  And once more SQLite3 would no longer be light.
Perhaps when built with ICU SQLite3 could make it trivial to load any of
those collations.

Nico
-- 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mail.sqlite wrote:
> 4. Your comments sound to me that you are living in a 7bit ASCII country, do 
> you?

I am British but currently happen to live in the US, and have lived in
every hemisphere whether you cut the earth horizontally or vertically.
My high school had students of 74 different nationalities.  In my
career, every single piece of software I have worked on had to work at a
*minimum* in the UK, USA (US English is different than UK English),
Europe (at least France & Germany, often Italy) as well as in Japanese.
 Those various pieces of software have run on embedded environments,
mini-computers, DOS, various flavours of Windows and a really wide
selection of Unix systems.

My repeated point throughout this is that I have no problem with taking
shortcuts to get better performance/conserve memory but that you should
be aware of how wrong the answers are.  And if proposing it as core
SQLite functionality then it really isn't fair unless the limitations
are well documented since other programmers using it won't be able to
tell how good the results are across all the different locales.

As an analogy I could come up with something that speeds up SQLite
floating point significantly but reduces accuracy of answers.  But
unless I documented how much the accuracy was reduced it would not be a
good idea for other developers to use it.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknCjWsACgkQmOOfHg372QS8zQCfVpXTHEWROfRsg4HiXzer4WK0
dz8AoKQhxwSpHMeV9xIR02XZaPlgQ/lr
=u9Xi
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Noah Hart
Igor Tandetnik writes:

>This is not quite true. You say custom functions are supported: then you can 
>do ORDER BY sortkey(textField), >with a suitably defined sortkey() function 
>(see strxfrm, LCMapString). You can't however build indexes using >such a 
>function, something you can do with a collation.

True, that is how I current accomplish it, but it is very inefficient, since 
indexes cannot be used, and a javascript function end up being called for the 
row by row comparison.

>Sorting in many locales is not as simple as suitably ordering individual 
>Unicode characters. E.g. in German >phonebook order, letter ö (small o with 
>umlaut aka diaresis) sorts as if it were two letters oe, that is od < ö >< of. 
>In French, strings are compared ignoring diacritics first, then ties are 
>broken by considering diacritics >right-to-left. In Spanish traditional sort, 
>a pair ch sorts as if it were a single letter between c and d. Even >in 
>English, you would often want to sort co-op and coop, or cant and can't, in 
>such a way that they are kept >together.

>Consider also things like combinig diacritics.

Again I agree.  In fact, with my application, I am not only dealing with 
UTF-16, but also there is the issue of combined letter forms.

This is not a trivial topic, and I don't have a proposed solution.  However, I 
feel this thread is worth pursuing.  There are a lot of smart people on this 
list and who use SQLite.

If the SQLite developers would be willing to open a discussion about embedding 
this functionality, maybe some brainstorming could come up with an acceptable 
solution.


Regards, Noah 






CONFIDENTIALITY NOTICE: 
This message may contain confidential and/or privileged information. If you are 
not the addressee or authorized to receive this for the addressee, you must not 
use, copy, disclose, or take any action based on this message or any 
information herein. If you have received this message in error, please advise 
the sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Igor Tandetnik
Noah Hart  wrote:
> Number 2, your platform may not support the sqlite3_create_collation
> interface. For example, Firefox now includes SQLite. Unfortunately,
> while Firefox supports user defined functions, their implementation
> does not support user defined collations.
>
> Someone commented that the US lives in a 7-bit world.
> This means that the other 6 billion people on the planet do not.
>
> This creates a real problem for me.  I am writing a foreign language
> Firefox extension, and the issue of sorting is critical, since
> Firefox uses Unicode sorting, which does not "sort" (based on my
> rules) correctly.   This means I have no way to correct the sorting,
> except in the display routines.

This is not quite true. You say custom functions are supported: then you 
can do ORDER BY sortkey(textField), with a suitably defined sortkey() 
function (see strxfrm, LCMapString). You can't however build indexes 
using such a function, something you can do with a collation.

> That being said, I would not limit this feature to 8bit locales.  A
> more general solution would be to design it around a sqlite_collation
> master table in the database. An application developer (not the
> SQLite team) would be responsible to define and populate their "user
> defined" collation.

Sorting in many locales is not as simple as suitably ordering individual 
Unicode characters. E.g. in German phonebook order, letter ö (small o 
with umlaut aka diaresis) sorts as if it were two letters oe, that is od 
< ö < of. In French, strings are compared ignoring diacritics first, 
then ties are broken by considering diacritics right-to-left. In Spanish 
traditional sort, a pair ch sorts as if it were a single letter between 
c and d. Even in English, you would often want to sort co-op and coop, 
or cant and can't, in such a way that they are kept together.

Consider also things like combinig diacritics.

Igor Tandetnik 



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Noah Hart
I've been reading and thinking about this topic for a while, and would like to 
add my thoughts.

I realize that we don't "vote" on features, but I feel that this type of idea 
has merit.

It is true, that SQLite has user defined collations, and a extension could be 
registered, but the problem with that is twofold:

Number 1, the database is no longer portable. The only solution to this is to 
include the functionality in the core.

Number 2, your platform may not support the sqlite3_create_collation interface. 
For example, Firefox now includes SQLite. Unfortunately, while Firefox supports 
user defined functions, their implementation does not support user defined 
collations.  

Someone commented that the US lives in a 7-bit world.  
This means that the other 6 billion people on the planet do not.

This creates a real problem for me.  I am writing a foreign language Firefox 
extension, and the issue of sorting is critical, since Firefox uses Unicode 
sorting, which does not "sort" (based on my rules) correctly.   This means I 
have no way to correct the sorting, except in the display routines.

That being said, I would not limit this feature to 8bit locales.  A more 
general solution would be to design it around a sqlite_collation master table 
in the database. An application developer (not the SQLite team) would be 
responsible to define and populate their "user defined" collation.

True, if I call my "user defined" collation sequence "COLLATE_PN", someone else 
might use the same identifier. Their lookup table even be different than the 
"COLLATE_PN" I am using.  However, I feel that is a deployment problem, not a 
development issue.  

It is unlikely that in a single SQLite database, we would have 2 user tables, 
with the same "user defined" collation identifier, but differing collation 
lookups.



I would encourage the developers to at least consider such a feature, after 
all, they did recently add the built-in RTRIM collation.


Regards,

Noah Hart


-Original Message-
- Please, let us try to bring down the discussion to the intended solution - a 
simple way to define and use a "user defined" collating for 8 bit ASCII 
characters! 
As said before, the proposal doesn't rely on locales. If a user needs a german 
collating sequence with sort order for phone-book, dictionary or german upper 
case, it's up to the user to supply a simple 256 byte string with the 
wanted/needed sort order for that index. It could be beneficial to all users 
with the need for special sorting requirements and almost no impact to cpu 
cycles, even on small systems. 

There sould be many users with 8bit ASCII locales requirements that would love 
such an extension.
 



CONFIDENTIALITY NOTICE: 
This message may contain confidential and/or privileged information. If you are 
not the addressee or authorized to receive this for the addressee, you must not 
use, copy, disclose, or take any action based on this message or any 
information herein. If you have received this message in error, please advise 
the sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-19 Thread Mail.sqlite

Roger:
1. Sorry, I don’t want to screw up somebody with my post
2. It was not my proposal
3. Thank you for the information that I may program/implement whatever I want
4. Your comments sound to me that you are living in a 7bit ASCII country, do 
you?

George
 

> -Ursprüngliche Nachricht- 
> Von: Roger Binns  
> An: General Discussion of SQLite Database  
> Datum: 18-03-2009 19:40 
> Betreff: Re: [sqlite] Proposal for SQLite and non pure ASCII letters 
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Mail.sqlite wrote:
> > - Please, let us try to bring down the discussion to the intended solution 
> > - a simple way to define and use a "user defined" collating for 8 bit ASCII 
> > characters! 
> 
> You don't need anyone's permission - go ahead and implement (or pay
> someone to implement) whatever you want.  Your proposal can be
> implemented as a SQLite extension - there is no need to modify SQLite
> itself.
> 
> On the other hand if you want it to become a standard part of SQLite
> then it isn't unreasonable to expect some documentation as to how
> accurate and useful it actually is.
> 
> Roger
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.9 (GNU/Linux)
> 
> iEYEARECAAYFAknBP/oACgkQmOOfHg372QSwaACeNXBtbCJsm/RvKTBs70HdOXYI
> R84AoMo4qgA6oDwfBzdobQLwW0af1z/p
> =/CRf
> -END PGP SIGNATURE-
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-18 Thread Alexey Pechnikov
Hello!

On Wednesday 18 March 2009 21:51:10 Roger Binns wrote:
> > Tcl, Python and other langs have different unicode implementations. The
> > realizations are more simple than ICU library but millions of
> > applications are using these. I'm will glad to see Tcl/Python/etc.
> > unicode implementation in SQLite.
>
> Speaking for Python, you can trivially hook the Python unicode support
> into SQLite if you use pysqlite or APSW.  However if you think the
> builtin Python Unicode support is good then you are mistaken :-)  See
> http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python

Yes, python and tcl unicode implementation is not good for me and I'm prefer 
version coded by Ioannis. But any implamentation may be better than no unicode 
support. ICU for small devices is nonsense. And ICU compilation is not trivial 
for some platforms.

> Python uses UTF16 or UTF32 internally for Unicode (a compile time
> option).  It isn't unreasonable to expect developers who are making
> heavy use of ICU like functionality (ie enough for performance to be
> noticeable) to set the default database encoding to UTF16 which means no
> encoding conversions while using ICU.  There may be other such "fast
> path" optimizations to improve performance when using ICU.

But it's not enough for many causes. SQLite has native encodings UTF8 and 
UTF16 and so unicode extension must supporting these too.

Did you know good and small unicode supporting library? I did try to find this 
in debian linux but I didn't find.

Best regards.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-18 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alexey Pechnikov wrote:
> Tcl, Python and other langs have different unicode implementations. The 
> realizations are more simple than ICU library but millions of applications 
> are 
> using these. I'm will glad to see Tcl/Python/etc. unicode implementation in 
> SQLite.

Speaking for Python, you can trivially hook the Python unicode support
into SQLite if you use pysqlite or APSW.  However if you think the
builtin Python Unicode support is good then you are mistaken :-)  See
http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python

> And I think we must have different Unicode implementations for UTF8 and UTF16 
> by performance reasons. ICU works with UTF16 and isn't good for UTF8 
> databases. Tcl has native encoding UTF16 too.

Python uses UTF16 or UTF32 internally for Unicode (a compile time
option).  It isn't unreasonable to expect developers who are making
heavy use of ICU like functionality (ie enough for performance to be
noticeable) to set the default database encoding to UTF16 which means no
encoding conversions while using ICU.  There may be other such "fast
path" optimizations to improve performance when using ICU.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknBQpoACgkQmOOfHg372QQmaQCfYxB/5SR9oxwR/0dCyimRpDvV
TYsAoMNWoT9xG4+cH3lehMmems+/chK7
=hroj
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-18 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mail.sqlite wrote:
> - Please, let us try to bring down the discussion to the intended solution - 
> a simple way to define and use a "user defined" collating for 8 bit ASCII 
> characters! 

You don't need anyone's permission - go ahead and implement (or pay
someone to implement) whatever you want.  Your proposal can be
implemented as a SQLite extension - there is no need to modify SQLite
itself.

On the other hand if you want it to become a standard part of SQLite
then it isn't unreasonable to expect some documentation as to how
accurate and useful it actually is.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknBP/oACgkQmOOfHg372QSwaACeNXBtbCJsm/RvKTBs70HdOXYI
R84AoMo4qgA6oDwfBzdobQLwW0af1z/p
=/CRf
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-18 Thread Alexey Pechnikov
Hello!

> From the German example, you can't even do that (name order is different
> than dictionary order).  I think we are agreed that the default SQLite
> implementation gets ASCII right and makes no attempt to deal
> specifically with non-ASCII locales.  The ICU extension gets all the
> locales as right as possible which is why it is huge and slow.
>
> If I took all the text from a leading newspaper in each locale, how well
> would it do.  Would it deal correctly with the name of the prime
> minister or the capital city?
>
> An alternate approach would be working with the ICU folks to improve the
> size and performance of their library.  For example the code could be
> refactored to have fast paths for the most common conversions to improve
> performance, or be able to omit various lesser used locales to improve
> size.

Tcl, Python and other langs have different unicode implementations. The 
realizations are more simple than ICU library but millions of applications are 
using these. I'm will glad to see Tcl/Python/etc. unicode implementation in 
SQLite. I don't need to have text representation for digits and other features 
from ICU. I did try to create collations from tcl but it's more slow than ICU!

And I think we must have different Unicode implementations for UTF8 and UTF16 
by performance reasons. ICU works with UTF16 and isn't good for UTF8 
databases. Tcl has native encoding UTF16 too.

Best regards.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-18 Thread Mail.sqlite

- Please, let us try to bring down the discussion to the intended solution - a 
simple way to define and use a "user defined" collating for 8 bit ASCII 
characters! 
As said before, the proposal doesn't rely on locales. If a user needs a german 
collating sequence with sort order for phone-book, dictionary or german upper 
case, it's up to the user to supply a simple 256 byte string with the 
wanted/needed sort order for that index. It could be beneficial to all users 
with the need for special sorting requirements and almost no impact to cpu 
cycles, even on small systems. 

There sould be many users with 8bit ASCII locales requirements that would love 
such an extension.
 
George

> -Ursprüngliche Nachricht- 
> Von: Roger Binns  
> An: General Discussion of SQLite Database  
> Datum: 17-03-2009 23:55 
> Betreff: Re: [sqlite] Proposal for SQLite and non pure ASCII letters 
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> > I'm know. But you can implement locale-dependent version for single 
> > language 
> > only. 
> 
> - From the German example, you can't even do that (name order is different
> than dictionary order).  I think we are agreed that the default SQLite
> implementation gets ASCII right and makes no attempt to deal
> specifically with non-ASCII locales.  The ICU extension gets all the
> locales as right as possible which is why it is huge and slow.
> 
> So that leaves a middle ground of somewhat right but being
> lighter-weight.  The problem with having that as a default part of
> SQLite is that it will be giving the wrong answers, but rarely will
> developers realise just how wrong.  And there are various tradeoffs that
> can be made between size/performance and correctness.
> 
> Consequently it should be documented just where each implementation
> stands.  The one you linked to is nice in that it documents in the code
> exactly how much bigger things become and you documented it being 4x
> faster than ICU.  But what isn't documented is how accurate it is.
> 
> If I took all the text from a leading newspaper in each locale, how well
> would it do.  Would it deal correctly with the name of the prime
> minister or the capital city?
> 
> An alternate approach would be working with the ICU folks to improve the
> size and performance of their library.  For example the code could be
> refactored to have fast paths for the most common conversions to improve
> performance, or be able to omit various lesser used locales to improve size.
> 
> Roger
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.9 (GNU/Linux)
> 
> iEYEARECAAYFAknAKkkACgkQmOOfHg372QT++QCfdWoE0nW5Cu2MbuZKzI49+hlx
> PSMAoLuN79Zh3dcHxKxS1L/QJOCGEpH8
> =kWCI
> -END PGP SIGNATURE-
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-17 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

> I'm know. But you can implement locale-dependent version for single language 
> only. 

- From the German example, you can't even do that (name order is different
than dictionary order).  I think we are agreed that the default SQLite
implementation gets ASCII right and makes no attempt to deal
specifically with non-ASCII locales.  The ICU extension gets all the
locales as right as possible which is why it is huge and slow.

So that leaves a middle ground of somewhat right but being
lighter-weight.  The problem with having that as a default part of
SQLite is that it will be giving the wrong answers, but rarely will
developers realise just how wrong.  And there are various tradeoffs that
can be made between size/performance and correctness.

Consequently it should be documented just where each implementation
stands.  The one you linked to is nice in that it documents in the code
exactly how much bigger things become and you documented it being 4x
faster than ICU.  But what isn't documented is how accurate it is.

If I took all the text from a leading newspaper in each locale, how well
would it do.  Would it deal correctly with the name of the prime
minister or the capital city?

An alternate approach would be working with the ICU folks to improve the
size and performance of their library.  For example the code could be
refactored to have fast paths for the most common conversions to improve
performance, or be able to omit various lesser used locales to improve size.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknAKkkACgkQmOOfHg372QT++QCfdWoE0nW5Cu2MbuZKzI49+hlx
PSMAoLuN79Zh3dcHxKxS1L/QJOCGEpH8
=kWCI
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-17 Thread Alexey Pechnikov
Hello!

On Tuesday 17 March 2009 22:00:29 Roger Binns wrote:
> > You can find locale-independant Unicode extension here:
> > http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-
> > support/
> >
> > This work fine for me and about 4x faster than ICU.
>
> That code should be tidied up a bit.  It has copied several functions
> from the SQLite source, and left them named using a sqlite3 prefix so it
> will clash if included in the same file as the amalgamation.

I did fix it by simple sed script. So I did compile SQLite source tree with 
this module. With SQLite 3.6.10 the build did pass all tests.

> The final results will be significantly better than the original
> proposal here, but since it is locale independent they will not be
> accurate for all locales.  For example German uses a different sort
> order for names (eg phone book) vs dictionary, Swedish uses a different
> sort order for accented characters than German etc, and so any locale
> independent sorting will not be 100% correct.

I'm know. But you can implement locale-dependent version for single language 
only. I think we need locale-independent version and versions for all locales 
in SQLite source tree. May be table-driven locales support will be good. ICU 
library is huge and very slow. It's problem that SQLite doesn't support 
unicode by light-weight extension. I did write to sqlite-dev but without any 
effect. I'm very sorry because unicode is "must have" feature now.

Best regards.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-17 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Alexey Pechnikov wrote:
> You can find locale-independant Unicode extension here:
> http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-
> support/
> 
> This work fine for me and about 4x faster than ICU.

That code should be tidied up a bit.  It has copied several functions
from the SQLite source, and left them named using a sqlite3 prefix so it
will clash if included in the same file as the amalgamation.

The final results will be significantly better than the original
proposal here, but since it is locale independent they will not be
accurate for all locales.  For example German uses a different sort
order for names (eg phone book) vs dictionary, Swedish uses a different
sort order for accented characters than German etc, and so any locale
independent sorting will not be 100% correct.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkm/80cACgkQmOOfHg372QToDwCgo3j7m6FV9Sdhn423uqL6I/q8
vEMAnR9bblsCe0LiRwSLyoFM/P4M3/sV
=nK/M
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-17 Thread Alexey Pechnikov
Hello!

On Thursday 12 March 2009 10:09:28 Roger Binns wrote:
> > I was asking the group their opinion about the usefulness of such
> > feature.
>
> As described I think it is mostly useless.  Sorting even just for West
> European languages is not trivial.  For example some countries have a
> different sort order for using names vs in the dictionary.  (This
> complexity is part of the reason why ICU is so large.)  As described it
> would only get things mostly right.  That is why I suggested documenting
> where it gets things wrong - for example grab some random Italian,
> Swedish, Flemish etc text and sort using your algorithm and using ICU
> and compare the results.  If your algorithm is "wrong" 0.001% of the
> time then it is very useful to many people.  If it is "wrong" 40% of the
> time then it would only be useful in a limited set of circumstances

You can find locale-independant Unicode extension here:
http://ioannis.mpsounds.net/blog/2007/12/19/sqlite-native-unicode-like-
support/

This work fine for me and about 4x faster than ICU.

Best regards.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-13 Thread Olaf Schmidt

"Jean-Christophe Deschamps" 
schrieb im Newsbeitrag news:7.0.1.0.2.20090312013713.0262b...@q-e-d.org...

> Perhaps the best way is practice: what's the way to find
> this guy named Éric or is it éric, or Eric, or eric?
> He lives in MÜNCHEN, München, MUNCHEN,
> Munchen or ... Munich.

IMO that's not that much a locale-problem - for these kind of
"unsharp-queries" one can use (but then usually requiring a
fulltable-scan) appropriate userdefined functions like e.g.
a Ratcliff/Obershelp-Algo or Metaphone or something
similar.

Here comes a short code-snippet, what my results are
regarding your example-case from above, using the
Ratcliff-Implementation of my VB-Wrapper.


First the results for different search-terms:
(ID, City, Name is the Tbl-Content - RcCity, RcName
 and RcSum are calculated Ratcliff-Results [0 - 100])

for City="munchen" and Name="eric"

ID  City  Name  RcCityRcNameRcSum
 2  Munchen   Eric  100   100   200
 3  MUNCHEN   eric  100   100   200
 1  München   Erich  7967   146
 4  Munic Éric   4596   141
 9  MoskauSergej 272855
 8  LondonPaul   272148
 6  Rome  Milan  301444
 7  MadridJose   112132
10  MoscowAlexey 111021
 5  Paris Serge   01414

for City="moscow" and Name="sergey"

ID  City  Name  RcCityRcNameRcSum
 9  MoskauSergej 50   87137
10  MoscowAlexey100   36136
 5  Paris Serge  14   94108
 7  MadridJose   17   32 49
 4  Munic Éric   32   10 42
 8  LondonPaul   330 33
 1  München   Erich   9   14 23
 2  Munchen   Eric   11   10 21
 3  MUNCHEN   eric   11   10 21
 6  Rome  Milan  120 12

Ok, here comes the used data and the query-example:
(sorry, just a Copy'nPaste from a small test-project -
 no C-Code - just that you get the idea, how the queries
 could look like and what the small test-set was)

Set Cnn = New cConnection
Cnn.CreateNewDB 'without Param creates an InMemory-DB

Cnn.Execute "Create Table T(ID Integer Primary Key, City Text, Name Text)"
'first a few similar sounding inserts
Cnn.Execute "Insert Into T(City, Name) Values('München', 'Erich')"
Cnn.Execute "Insert Into T(City, Name) Values('Munchen', 'Eric')"
Cnn.Execute "Insert Into T(City, Name) Values('MUNCHEN', 'eric')"
Cnn.Execute "Insert Into T(City, Name) Values('Munic', 'Éric')"

'and a few different ones
Cnn.Execute "Insert Into T(City, Name) Values('Paris', 'Serge')"
Cnn.Execute "Insert Into T(City, Name) Values('Rome', 'Milan')"
Cnn.Execute "Insert Into T(City, Name) Values('Madrid', 'Jose')"
Cnn.Execute "Insert Into T(City, Name) Values('London', 'Paul')"
Cnn.Execute "Insert Into T(City, Name) Values('Moskau', 'Sergej')"
Cnn.Execute "Insert Into T(City, Name) Values('Moscow', 'Alexey')"

'now the Ratcliff-based search
Dim SQL As String, Cmd As cSelectCommand
'define a Select-Command
SQL = "Select *, (RcCity + RcName) As RcSum From" & _
 " (Select *" & _
 ", Ratcliff(City,@City) As RcCity" & _
 ", Ratcliff(Name,@Name) As RcName" & _
 "  From T Order By (RcCity + RcName) Desc Limit 10)"
Set Cmd = Cnn.CreateSelectCommand(SQL)

'let's set the two Text-Params on the Cmd-Object and Execute
Cmd.SetText Cmd!City, "munchen"
Cmd.SetText Cmd!Name, "eric"

'  just another Param-Set
'  Cmd.SetText Cmd!City, "moscow"
'  Cmd.SetText Cmd!Name, "sergey"

Set Rs = Cmd.Execute()

'dump the results from the Rs-Content
Dim Fld As cField
For Each Fld In Rs.Fields 'enumerate Field-Names
  Debug.Print Fld.Name,
Next Fld
Debug.Print
Do Until Rs.EOF
  For Each Fld In Rs.Fields  'enumerate Field-Values
Debug.Print Fld.Value,
  Next Fld
  Debug.Print
  Rs.MoveNext
Loop

Olaf



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-12 Thread Roger Binns
Jean-Christophe Deschamps wrote:
>> You are aware that "standard" SQLite is used in devices with a few
>> kilobytes of memory through workstations and servers with gigabytes of it!
> 
> That's precisely why such approach is interesting!

I apologise for not making myself clear.  Everything that is part of
standard SQLite affects all users of it and imposes testing and
compatibility constraints.  Even people not using it will have to use
OMIT directives and so all the testing will have to be done with that on
and off.

>> As far as I can tell you want some extra "standard" collation sequences
>> and propose shortcuts that will get them mostly right.  And you want
>> someone else to write the code!
> 
> Phew!  It wasn't not my intention to see anyone here go ballistic 
> reading my post.

I wasn't going ballistic - just trying to summarise your request in one
paragraph :-)

>1) I never pretended at any "standard", just a useful set of features

I quote:

  I'd like to have the group opinion about a feature I would find
  utterly useful in _standard_ SQLite.

>2) what I said is that it can be made to fit _most_ needs with little
>   requirements in memory and cycles

More accurately it would fit with your needs.  What you are talking
about is collation which means ordering of different strings with
respect to each other.  For SQLite that also means being case sensitive
or insensitive.  Equality also falls out of this.

>3) I offered (twice) to pay a reasonable fee for such development

That isn't a problem.  It just places you a little further away from the
issues that arise.

> ... if it  implies some branching from mainstream core. 

You won't need to branch - just register as an extension.  It also
doesn't need the SQLite team to implement your proposal - anyone can.

> True, but I think that no extension will ever overload consistanly the 
> macros (!)
> and other deep code (like inside FTS3) involved in REGEX, ORDER, 
> GLOB/LIKE, UPPER...
> Or maybe I overlook something obvious.

You are aware that FTS3 is implemented exactly as an extension?  It
doesn't modify the core of SQLite, just registers as an extension like
anyone else's code could.  Anything FTS3 could do, you can do.

> OTOH, having collation and comparison located in _code_ makes them 
> difficult to port.

They have to be in code at some point.  The default builtin code is very
simplistic, only paying attention to ASCII and just comparing raw bytes
for non-ascii characters.

ICU is also "located in code", but uses many tables to drive what
happens. As far as I can tell from your proposal you would also have
code to do the comparisons and store data to help drive that in the
database somewhere.


> I'm not lobbying,

That is exactly what your first sentence was :-)

> nor asking for good practice guidance. 

The purpose of the "guidance" was to show how to get your proposal to be
a standard part of SQLite.

> I was asking the group their opinion about the usefulness of such feature. 

As described I think it is mostly useless.  Sorting even just for West
European languages is not trivial.  For example some countries have a
different sort order for using names vs in the dictionary.  (This
complexity is part of the reason why ICU is so large.)  As described it
would only get things mostly right.  That is why I suggested documenting
where it gets things wrong - for example grab some random Italian,
Swedish, Flemish etc text and sort using your algorithm and using ICU
and compare the results.  If your algorithm is "wrong" 0.001% of the
time then it is very useful to many people.  If it is "wrong" 40% of the
time then it would only be useful in a limited set of circumstances.

> As far as I can see, it
> would require to
> be part of the core to deliver full power.

I still don't understand that claim.  Both ICU and FTS3 are extensions
and deliver their functionality that way.  Your algorithm could also be
delivered that way.  Once it has demonstrated utility then it would be
far easier to get into the core.

> I understand how difficult it is for some english-only developpers or 
> users, having to
> support code for non-english speakers / writers. I just thought it was 
> fading now.

And some developers have had to work on software that supports users all
over the globe throughout their career and know just how difficult this
stuff is!  And what happens when you get it wrong.

> It doesn't work with "locales" at all.

Errr, yes it does.  You are defining sort order and capitalization that
are to be presented to a user.  That user has a locale.

> It allows the user to declare its own set of characters
> and the way SQLite should handle them for low-level operations.

So the "feature" wouldn't ship with any of the sequences?  Each
developer using the feature would have to come up with their own?  You
won't allow more than one? In case you haven't looked at ICU, it
basically consists of code and data tables which deal with each l

Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Ian Walters

On 12/03/2009, at 12:36 PM, Jean-Christophe Deschamps wrote:
>
> To answer another post by Ian, yes I've had a look at ICU.  Of course
> ICU knows about its size, but what can they do about it, since their
> goal is to implement the most complete support possible?

Make parts of it be possible to factor out.  I've seen many 'complete'  
software packages that could have all the bits you don't need simply  
turned off and removed.

> And again,
> that's a very good thing. But my reluctance is elsewhere: I do not
> adhere that much in locales. I use more than a strictly defined locale
> "locale" and less than a world locale!

I'm not sure what you mean by this.  But I do know that 'locale' is  
important.  Even with the same set of characters two different locales  
will use two different orderings for the exact same set of  
characters.  And yes, I'm referring to Western Europe locales here.

>
> Also, yes I write most users. BTW SQLite would possibly have more
> "rest-of-the-world" fans if a more simple handling of diacritics were
> available!  Look: we in Western Europe have been computerized in ALL
> PURE ASCII CAPITALS for many years. Then in mixed casing, but always
> ASCII. Now is perhaps the time for something else.

Well, I may be a Ascii speaker, but the company I work for was founded  
in Norway and bought by a company in Finland with most of its software  
originally being used in Germany.  I use diacritics enough that I know  
how to type them by memory.  My own work as part of all this for a few  
years had most of its customers in China. Which is all to say: I  
always find it funny how some non-Americans talk about how centric the  
American point of view is and then seem to forget just how many people  
are not from Western Europe either.  Western Europe < (rest-of-world/2)

If fast localization is that important to you perhaps use Mimer  
instead?  Its collation is pretty good, it also (in my own tests) only  
a little slower and a little larger than SQLite.  There are some other  
restrictions as well, but it may be worth a look.

It would certainly be nice if extending collation in SQLite extended a  
little more into the deep code (in my own case I'd to change how NULLS  
are handled as well, but thats a different case).  But its also  
important to realize the value of SQLite, and some of that is exactly  
how minimal the core of it is.

--
Ian
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread John Machin
On 12/03/2009 1:36 PM, Jean-Christophe Deschamps wrote:

> 
> BTW locales are far from perfection. For instance: you have to search 
> text, say an address book in a cellphone, with FTS3 and you know the 
> base may have words or names in a dozen european languages. How would 
> you do?  ICU? Huge and slow, but even then: which "locale" would you use?
> 
[snip]
> 
> Perhaps the best way is practice: what's the way to find this guy named 
> Éric or is it éric, or Eric, or eric? He lives in MÜNCHEN, München, 
> MUNCHEN, Munchen or ... Munich.

Ummm ... you do it in a case-agnostic language-agnostic fashion with 
fuzzy matching. You end up looking for "eric" in "munchen" or 
"muenchen". Munich is another story ... For cities sufficiently well 
known abroad that are spelled totally differently from the normal 
"romanisation" you need alias tables e.g. you can derive "moskva" by 
transliteration from the Cyrillic but "moscow" needs a look-up. IMHO 
none of this should be expected to be built-in to the DBMS.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Dennis Cote
Roger Binns wrote:
> Jean-Christophe Deschamps wrote:
>   
>> I'd like to have the group opinion about a feature I would find utterly 
>> useful in _standard_ SQLite.
>> 
>
> You are aware that "standard" SQLite is used in devices with a few
> kilobytes of memory through workstations and servers with gigabytes of it!
>
>   

Whether he is aware of that or not is largely irrelevant, though I 
suspect he is aware of this fact. He is asking for support for a 
mechanism that will provide most, if not all, the benefits of the ICU 
extension using much less memory and far fewer CPU cycles. This would 
benefit the users of small devices more than those using workstations, 
but it could be beneficial to all.

It would be almost universally beneficial if it could be omitted using a 
compiler define. Then even the small additional overhead of his proposal 
wouldn't impact those users who have no need for anything beyond ASCII.

> As far as I can tell you want some extra "standard" collation sequences
> and propose shortcuts that will get them mostly right.  And you want
> someone else to write the code!
>   

No, he is asking for standard support for "user defined" collating 
sequences. And, yes, he wants some else to write the code since it (like 
many other facilities) is not nearly as useful if it is not included in 
the standard SQLite released by Richard. He has even offered to pay to 
have it developed.

> SQLite makes it very easy to have extensions and to register them.  For
> example see http://sqlite.org/c3ref/auto_extension.html
>
> Generally the best approach would be to produce the code as an
> extension, document and test it well and then add to the contributions
> page at http://sqlite.org/contrib - once enough developers have used it
> and vouched for its utility then it would be far easier to lobby for
> incorporation into the "standard" SQLite.
>   

The problem with extensions is precisely that they are not universally 
available. If my application uses an extension to provide collation 
sequences, then I can not safely use any of the many GUI database 
browsers to manipulate the data since the GUI browser won't have access 
to the extension functions.
 
> For you to convince me of the utility of the code, you'd need to list
> which locales it gets right and which it gets wrong.  Software can seem
> pretty dumb to users almost getting some things right.
>
>   
His proposal doesn't rely on locales. It uses "user defined" strings to 
define a collating sequence. It will be right where you can define a 
collating sequence using a string, and a suitable string is defined.

He has already said that it won't be suitable for multiple byte 
characters or many other languages. For those cases where it is not 
suitable, a user could continue to use the ICU extension just as they 
can now.

Dennis Cote


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Jean-Christophe Deschamps

Roger,



>You are aware that "standard" SQLite is used in devices with a few
>kilobytes of memory through workstations and servers with gigabytes of it!

That's precisely why such approach is interesting!


>As far as I can tell you want some extra "standard" collation sequences
>and propose shortcuts that will get them mostly right.  And you want
>someone else to write the code!

Phew!  It wasn't not my intention to see anyone here go ballistic 
reading my post.

But look:
   1) I never pretended at any "standard", just a useful set of features
   2) what I said is that it can be made to fit _most_ needs with little
  requirements in memory and cycles
   3) I offered (twice) to pay a reasonable fee for such development if it
  implies some branching from mainstream core. The support page 
indeed reads:
If you would like professional support for SQLite or if you want custom 
modifications performed by the original author or SQLite, these 
services are available for a modest fee. For additional information 
visit 
http://www.hwaci.com/sw/sqlite/prosupport.html
 
or contact:
D. Richard Hipp
Hwaci - Applied Software Research
704.948.4565
d...@hwaci.com

I don't make this up you know.


>SQLite makes it very easy to have extensions and to register them.  For
>example see 
>http://sqlite.org/c3ref/auto_extension.html

True, but I think that no extension will ever overload consistanly the 
macros (!)
and other deep code (like inside FTS3) involved in REGEX, ORDER, 
GLOB/LIKE, UPPER...
Or maybe I overlook something obvious.
OTOH, having collation and comparison located in _code_ makes them 
difficult to port.


>Generally the best approach would be to produce the code as an
>extension, document and test it well and then add to the contributions
>page at http://sqlite.org/contrib - once 
>enough developers have used it
>and vouched for its utility then it would be far easier to lobby for
>incorporation into the "standard" SQLite.

I'm not lobbying, nor asking for good practice guidance. I was asking 
the group their
opinion about the usefulness of such feature. As far as I can see, it 
would require to
be part of the core to deliver full power.

I understand how difficult it is for some english-only developpers or 
users, having to
support code for non-english speakers / writers. I just thought it was 
fading now.


>For you to convince me of the utility of the code, you'd need to list
>which locales it gets right and which it gets wrong.  Software can seem
>pretty dumb to users almost getting some things right.

Please make me the favor to _read_ my post. I believe you have enough 
experience to
understand it and think about the _practical_ usefulness of such 
behavior. It doesn't
work with "locales" at all. It allows the user to declare its own set 
of characters
and the way SQLite should handle them for low-level operations. It's a 
DIY thing!

BTW locales are far from perfection. For instance: you have to search 
text, say an address book in a cellphone, with FTS3 and you know the 
base may have words or names in a dozen european languages. How would 
you do?  ICU? Huge and slow, but even then: which "locale" would you use?

I'm in no way saying that ICU is a bad thing. Of course it's been 
extremely carefully coded and reviewed. It's a very nice building block 
... but only for those that need the most comprehensive solution 
available (and can afford to bring it aboard).

Perhaps the best way is practice: what's the way to find this guy named 
Éric or is it éric, or Eric, or eric? He lives in MÜNCHEN, München, 
MUNCHEN, Munchen or ... Munich.


To answer another post by Ian, yes I've had a look at ICU.  Of course 
ICU knows about its size, but what can they do about it, since their 
goal is to implement the most complete support possible? And again, 
that's a very good thing. But my reluctance is elsewhere: I do not 
adhere that much in locales. I use more than a strictly defined locale 
"locale" and less than a world locale!

Also, yes I write most users. BTW SQLite would possibly have more 
"rest-of-the-world" fans if a more simple handling of diacritics were 
available!  Look: we in Western Europe have been computerized in ALL 
PURE ASCII CAPITALS for many years. Then in mixed casing, but always 
ASCII. Now is perhaps the time for something else.



Jean-Christophe 

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Ian Walters

On 12/03/2009, at 8:15 AM, Jean-Christophe Deschamps wrote:

>
> I feel the need for a different implementation of collating support.
> Not every user of SQLite needs full universal collating support à la
> ICU. It's huge and slows things down significantly.

This may be a dumb question, but why not suggest ways to speed up and  
shrink ICU?  I quickly found a page showing ways to shrink the foot  
print, they are obviously aware and concerned about size.

Also be careful about statements about 'most users'.

http://www.aneki.com/languages.html

--
Ian
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Roger Binns
Jean-Christophe Deschamps wrote:
> I'd like to have the group opinion about a feature I would find utterly 
> useful in _standard_ SQLite.

You are aware that "standard" SQLite is used in devices with a few
kilobytes of memory through workstations and servers with gigabytes of it!

As far as I can tell you want some extra "standard" collation sequences
and propose shortcuts that will get them mostly right.  And you want
someone else to write the code!

SQLite makes it very easy to have extensions and to register them.  For
example see http://sqlite.org/c3ref/auto_extension.html

Generally the best approach would be to produce the code as an
extension, document and test it well and then add to the contributions
page at http://sqlite.org/contrib - once enough developers have used it
and vouched for its utility then it would be far easier to lobby for
incorporation into the "standard" SQLite.

For you to convince me of the utility of the code, you'd need to list
which locales it gets right and which it gets wrong.  Software can seem
pretty dumb to users almost getting some things right.

Roger
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Proposal for SQLite and non pure ASCII letters

2009-03-11 Thread Jean-Christophe Deschamps
Hello group,


I'd like to have the group opinion about a feature I would find utterly 
useful in _standard_ SQLite.

Here's a rewrite of mails sent to hwaci about it, without success so far.

Note: I guess that non pure ASCII characters in the sample strings 
below will translate to '?', but you can obviously get the idea.

---
First I'd like to congratulate contributors to SQLite. You know who and 
why!

Now I'd like to either submit a request or ask for a quotation wrt (yet 
another) new feature in SQLite.

SQLite has been supporting Unicode for a long time but Unicode brings 
its own problems. Collating is one of the most important. I know one 
can add a user-defined collating function but I think it doesn't solve 
all situations.

I feel the need for a different implementation of collating support. 
Not every user of SQLite needs full universal collating support à la 
ICU. It's huge and slows things down significantly.

Most users only have a very small number of close languages to deal 
with, e.g. European Western ones.  For 98% of those users only the 
diacritics are a problem, but these accented letters are only a tiny 
subset of Unicode for a given language.

Hence I believe a large user base could benefit of having a new 
"USERSTRING" collating support which could be implemented in only two 
couples of strings of equal length _stored_ in a reserved table of a 
database and a few lines of fast C code in the SQLite library.

The first string, call it OrderString, would define the specific 
character set required by this precise application and its sort order 
(ORDER BY). Any char not found there should be handled as per its 
Unicode value.  This would allow users to explicit the charset and 
sorting order they need, without having to bring giant tables and slow 
code in the picture.

The second string, call it ClassString, would hold the character class 
for each character in the first string.  Its use is for LIKE operator 
and FTS3 support, after lookup in Orderstring to get letter index in 
this string. You can choose to have both upper- and lower-case letters 
there (case sensitive LIKE) or only lowercase (case insensitive LIKE).

The third string, call it UpperString, would hold the uppercase version 
of the charset.

The fourth string, call it LowerString, would hold the lowercase 
version of the charset. Should be identical to UpperString for 
languages that don't have casing.

Let me elaborate how the OrderString can cope with letters in two or 
more intervals. To compare strings, we need only know how to compare 
single characters.
Let C and D be the characters to compare and let A be the first and Z 
the last character in OrderString. Also let Idx(char) be the position 
of char in Orderstring. We can proceed as follows:

ic = Index(C, OrderString)   // index of C or -1 if not found
id = Index(D, OrderString)   // index of D or -1 if not found
If C != -1
   If D != -1
 Return Sign(ic - id)
   Else
 Return Sign(D - A)
   Endif
Else
   If D != -1
 Return Sign(C - A)
   Else
 Return Sign(C - D)
   Endif
Endif

In plain language, if C & D are both inside or outside the user 
charset, then compare them directly within their set (user or 
rest-of-Unicode). Else compare the char not in userset against the 
first char in userset. This trick makes the userset a separate "branch" 
off Unicode, sorted apart from the Unicode "sequence".

The pseudo-code above should be compiled very efficiently by any decent 
compiler. OrderString not only defines the userset and its sort order, 
but it also defines the "letter" class for REGEXes, upper and lower 
classes being also well defined by UpperString and LowerString.

The LIKE operator goes along quite the same line. With notation above:

ic = Index(C, OrderString)   // index of C or -1 if not found
id = Index(D, OrderString)   // index of D or -1 if not found
If C != -1 AND If D != -1
 Return (ClassString[ic] == ClassString[id])
ElseIf C != -1 AND If D != -1
 Return (C == D)
Else
 Return False
Endif

If we would like to have LIKE act depending on a CASE_SENSITIVE flag 
[sounds it could be the compareInfo.noCase flag], then it would require 
ClassString be in fact TWO strings, UpperClassString and 
LowerClassString. Anyway, once the choice made, it's as easy to 
implement with straightforward code.

As an example, I'm interested in finding customers first or last names, 
street addresses or city names mostly for French customers. I may also 
have some Spanish and Portugese names around, as well as names from 
Netherlands, Germany, Switzerland, Italy. I would use FTS3 virtual 
tables knowing that FTS3 will use my user supplied strings to record 
words in pure ASCII, but I could use any kind of search with data 
containing diacritics: it would work like a charm transparently.

I would start with something like (mailing this could partly destroy it!):

OrderString =
  
'AÀÁÂÃÄÅÆaàáâãäåæBbCÇcçDdEÈÉÊËeèéê