Hi hi Jürg, Mikael & friends, I'm writing here the different collation implementations we've been considering, trying to write down the pros and cons of each solution; and hopefully we can reach an agreement on what is the best one to get implemented in Tracker.
The basic list of requirements related to collation would be: a) Order of results of queries based on collation, instead of binary comparison of strings. b) Collation is based on locale of the user, so changes of locale must be supported. c) Tracker should support returning collation keys of the result set of a query, like enabling a new tracker:collate() method in SPARQL. d) Something else? **************** 1) Using a new COLLATE keyword in SPARQL queries. This solution is probably the slowest one and probably not wanted by anyone, but anyway, I'm also considering it here. The idea is to have a new "COLLATE" keyword in SPARQL, so that if you say "ORDER BY (?column) COLLATE" it will force using collation in the order of the results. Pros: +++ Collation only used when needed. Not sure if that's a pro anyway; as soon as you are doing ORDER BY you are expecting a proper order, and the one provided by the binary comparison is useless even in English (for example in small letter vs capital letter). +++ Doesn't need neither changes in the db tables, nor any special treatment at database level. +++ No performance degradation when inserting new elements. +++ Doesn't need any special change to support locale changes, as every query is independent, so for each one the locale in that moment will be used. Cons: --- Slowest method probably, as it means that collation function is used for comparison for every query which is done to the store. --- If collation keys requested, they must be computed for each query, as they are nowhere stored. --- Order of results may be different if using dbus or direct-access backends. If using dbus the locale used for collation is the one of the tracker-store; while if using direct-access the locale used would be the one of the process making the query. **************** 2) Set the collation method by default in the text columns when creating the db tables. We actually already agreed on doing this, before realizing that we also need to be able to return collation keys (requirement C, above). Pros: +++ Fast in columns with indexes, as the collation would be used to create the indexes properly, and during the query the order is already implicit. +++ Integration with sqlite is direct, just enable the collation function when creating the column, and it will use it directly wherever needed. Easier to maintain, probably. +++ Almost no performance degradation when inserting new elements (see previous emails on this topic). Cons: --- If collation keys requested, they must be computed for each query, as they are nowhere stored. --- If a query is done using a column without index for ordering, collation function will be used to order the results, so it will be as slow as method 1). --- On locale change, indexes must be re-created, which may be slow if lots of elements in the tables. Although this is not a big deal at the end. **************** 3) New column to store the collation keys. This is the method used in Fremantle. The idea is that for every text column, we have an additional column where we store the collation key of the text. Then, when issuing an ORDER BY we would be using that extra column as input for the ordering, so that a simple binary comparison can be used. I guess we would need an index in that new column with collation keys, only for those text columns that have an index. Pros: +++ Fast in columns with indexes (as fast as method 2). +++ Fast in columns without indexes, as binary comparison can be used on the collation keys column. +++ Retrieving the collation keys in a query is direct, no need to compute them on the fly for the query results, so faster than method 2) for this purpose. Cons: --- Inserting new elements will be slowest of the 3 options (don't have numbers for this); as collation key needs to be computed. If we do this approach for all text columns, I'm wondering how long it will take to compute the collation key for plainTextContent... We could always say that we'll only add the new column on properties configured to be indexed, but then, what if we want an ORDER BY on a column not indexed? No collation would be applied in that case. --- Changes at db level are quite more than case 2). Not only the new column needs to be populated properly, but also the sparql to sql translator needs to be patched so that if ORDER BY(text-column) we really do ORDER BY (collation-key-column) in sql. Right? --- When locale changes, need to re-compute all collation keys and then re-create all indexes. This is quite slower than method 2, which only needs index re-creation). **************** If requesting the collation keys is going to be done usually, 3rd method seems the most appropriate. If this is not the case, I would go for the 2nd method as it will be the cleanest one and the easiest to maintain, as sqlite makes all the work for us (but maybe I'm biased because 2nd method is almost implemented :-) ) Comments? -- Aleksander _______________________________________________ tracker-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/tracker-list
