Jean-Christophe Deschamps wrote: >> You are aware that "standard" SQLite is used in devices with a few >> kilobytes of memory through workstations and servers with gigabytes of it! > > That's precisely why such approach is interesting!
I apologise for not making myself clear. Everything that is part of standard SQLite affects all users of it and imposes testing and compatibility constraints. Even people not using it will have to use OMIT directives and so all the testing will have to be done with that on and off. >> As far as I can tell you want some extra "standard" collation sequences >> and propose shortcuts that will get them mostly right. And you want >> someone else to write the code! > > Phew! It wasn't not my intention to see anyone here go ballistic > reading my post. I wasn't going ballistic - just trying to summarise your request in one paragraph :-) > 1) I never pretended at any "standard", just a useful set of features I quote: I'd like to have the group opinion about a feature I would find utterly useful in _standard_ SQLite. > 2) what I said is that it can be made to fit _most_ needs with little > requirements in memory and cycles More accurately it would fit with your needs. What you are talking about is collation which means ordering of different strings with respect to each other. For SQLite that also means being case sensitive or insensitive. Equality also falls out of this. > 3) I offered (twice) to pay a reasonable fee for such development That isn't a problem. It just places you a little further away from the issues that arise. > ... if it implies some branching from mainstream core. You won't need to branch - just register as an extension. It also doesn't need the SQLite team to implement your proposal - anyone can. > True, but I think that no extension will ever overload consistanly the > macros (!) > and other deep code (like inside FTS3) involved in REGEX, ORDER, > GLOB/LIKE, UPPER... > Or maybe I overlook something obvious. You are aware that FTS3 is implemented exactly as an extension? It doesn't modify the core of SQLite, just registers as an extension like anyone else's code could. Anything FTS3 could do, you can do. > OTOH, having collation and comparison located in _code_ makes them > difficult to port. They have to be in code at some point. The default builtin code is very simplistic, only paying attention to ASCII and just comparing raw bytes for non-ascii characters. ICU is also "located in code", but uses many tables to drive what happens. As far as I can tell from your proposal you would also have code to do the comparisons and store data to help drive that in the database somewhere. > I'm not lobbying, That is exactly what your first sentence was :-) > nor asking for good practice guidance. The purpose of the "guidance" was to show how to get your proposal to be a standard part of SQLite. > I was asking the group their opinion about the usefulness of such feature. As described I think it is mostly useless. Sorting even just for West European languages is not trivial. For example some countries have a different sort order for using names vs in the dictionary. (This complexity is part of the reason why ICU is so large.) As described it would only get things mostly right. That is why I suggested documenting where it gets things wrong - for example grab some random Italian, Swedish, Flemish etc text and sort using your algorithm and using ICU and compare the results. If your algorithm is "wrong" 0.001% of the time then it is very useful to many people. If it is "wrong" 40% of the time then it would only be useful in a limited set of circumstances. > As far as I can see, it > would require to > be part of the core to deliver full power. I still don't understand that claim. Both ICU and FTS3 are extensions and deliver their functionality that way. Your algorithm could also be delivered that way. Once it has demonstrated utility then it would be far easier to get into the core. > I understand how difficult it is for some english-only developpers or > users, having to > support code for non-english speakers / writers. I just thought it was > fading now. And some developers have had to work on software that supports users all over the globe throughout their career and know just how difficult this stuff is! And what happens when you get it wrong. > It doesn't work with "locales" at all. Errr, yes it does. You are defining sort order and capitalization that are to be presented to a user. That user has a locale. > It allows the user to declare its own set of characters > and the way SQLite should handle them for low-level operations. So the "feature" wouldn't ship with any of the sequences? Each developer using the feature would have to come up with their own? You won't allow more than one? In case you haven't looked at ICU, it basically consists of code and data tables which deal with each locale. > BTW locales are far from perfection. For instance: you have to search > text, say an address book in a cellphone, with FTS3 and you know the > base may have words or names in a dozen european languages. How would > you do? ICU? Huge and slow, but even then: which "locale" would you use? When you display stuff to a user, you are always using a locale. As a developer you need to be aware and do whatever the right thing is. For your example I would have the locale set by the user of the cellphone. If they say French then I'd use the French collation rules. If they said German then I'd use the German rules (which differ for phone books vs other uses). The display would end up in a different order. Note that ICU would get that correct today. It is only huge and slow because of all the locales it supports and the idiosyncrasies of each locale. But it does get the answer right. Other approximations such as your proposal would only get the answers mostly right. > Perhaps the best way is practice: what's the way to find this guy named > Éric or is it éric, or Eric, or eric? He lives in MÜNCHEN, München, > MUNCHEN, Munchen or ... Munich. We all agree that it is difficult. Basically there are 3 solutions: 1 - Do ASCII only, other locales are deliberately/obviously wrong 2 - A small light algorithm that gets some locales somewhat right 3 - Get all locales 100% correct SQLite ships with #1 and #3. For #2 it is important just how right (or wrong) it gets various locales. The Unicode Collation Algorithm - http://unicode.org/reports/tr10/ - also tries #2 although you'd probably consider it to much code/data. The begining of that page also describes many locale dependent issues just like your example, and why the locale matters. ICU started out as the UCA and now has lots more code and data. > I use more than a strictly defined locale > "locale" and less than a world locale! But that only solves your specific situation! The more specific the less value it has to anyone else. > Now is perhaps the time for something else. We already have two choices - get only ASCII 100% right and get all locales 100% right (ICU). Proposing something inbetween should at least give an idea of accuracy. For example if you said 99% right for 90% of Western European locales with only 100 lines of code and 1000 bytes of data then it is a good proposal. Roger _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users