[Input == great!] ----
Regarding space usage, my current prototype stores an additional column on %_content, defined like 'tokenizer TEXT DEFAULT NULL'. If NULL, the default tokenizer is used and the per-row cost is negligible. Otherwise, it is the string specifying the tokenizer, which I would expect to generally be 5-8 bytes, which I think is pretty reasonable. I think that if you do not need the ability to customize tokenizer on a per-row basis, there should be no storage cost compared to the current implementation. I'm not quite confident in making that an explicit goal, but I generally do think fts should minimize the cost of a feature for anyone who's not using that feature, so long as it doesn't cause excessive complexity (I don't think being low-cost in this case will). There are a couple options which could be used if this is deemed excessive. An internal mapping table could be used to allow the tokenizers to be represented as varints, which would effectively make things fixed, at the cost of a page for the mapping table. An alternative would be to have the mapping table store the actual rowid->tokenizer mapping, using various compression tricks (like the fts index uses) to keep things compact. In either case, there's a bit higher fixed cost involved. ---- Regarding per-row versus per-column tokenizers, your suggesting to have something like 'content_kr TOKENIZER icu_kr' for each variant is reasonable, but I'm not certain what gain it would have over simply having separate tables for each variant. Since data about records is pushed into the schema itself, you potentially have to generate fairly intricate queries, probably dynamically. This implementation also suffers from the query-tokenization problem you mention, except insofar as you MATCH only against specific columns. To be clear on the type of problem my proposal was targetted at, say you have something like Google Reader, where there are a bunch of articles that you want to search over. Each article generally doesn't have multiple translated variants, instead the language is a piece of data about the article. For this kind of system, pre-defining columns for every language the system may encounter might result in dozens of columns, with exactly one column used for any particular row. NOTE: I think that your idea of per-column default tokenizers may be a good idea for fts to have. I'm just questioning whether it is targetted at the same problem my proposal is. I'm going to think about whether the proposals can be sensibly merged (they're not quite orthogonal, because using both on the same table effectively requires per-field tokenizer specification). ---- Regarding query tokenization, yes, the query must be parsed using a tokenizer appropriate to the query. That was where the suggested change to the query syntax came from: SELECT rowid FROM t WHERE t MATCH 'tokenizer:<x> ...'; Tokenizing the query using every possible tokenizer may result in more results, but is unlikely to result in higher quality of results. fts could do some sort of index-directed tokenization (if you tokenize into a bunch of terms which aren't in the index, you won't hit anything!), but that seems like a project for some future quarter. I think doing anything automagic with query tokenization is a challenge, because queries tend much more towards sentence fragments. -scott On 9/18/07, Ralf Junker <[EMAIL PROTECTED]> wrote: > Hello Scott Hess, > > >In the interests of not committing something that people won't like, my > >current proposal would be to add an implicit TOKENIZER column, which will > >override the table's default tokenizer for that row. > > There are a few things I am worried about with this approach: > > 1. FTS storage size > > Will the TOKENIZER column not add to the overall size of the FTS storage, > even if the default tokenizer is used? As FTS requires to store all text, its > storage requirements are quite high already and did put people of SQLite as > their full text search implementation. > > 2. Potential incompatability with query parser tokenizer > > The table's text tokenizer is used to tokenize the query string as well. > AFAIK, both must be identical. I can not see how this single query tokenizer > can then cooperate with a potentially unlmited number of incompatible row > tokenizers. Reparsing the query for each row is, it guess, out of the > question for performance reasons. > > * Alternative suggestion > > Offer a per COLUMN tokenizer option instead of a per ROW one. This would get > rid of problem 1 because the tokenizer can be stored with the column > definition. > > The COLUMN tokenizer option would also help with problem 2: The engine can > then parse the query according to each column's tokenizer setting. Not all > queries might make sense with all columns, but at least the engine would > guarantee that both are using the identical tokenizer. It would be up to the > application to search certain columns for a particular language query only. > > I also find the per column tokenizer override easier to grasp (like for > translations, for that purpose), because one can different language columns > with different tokenizers: Content_EN, Content_KR, and so on. This of course > assumes that the number of supported languages is limited. New languages can > be added with ALTER TABLE, but an application with support for an infinite > number of langages would probably opt for the one table per language option. > > Ralf > > > ----------------------------------------------------------------------------- > To unsubscribe, send email to [EMAIL PROTECTED] > ----------------------------------------------------------------------------- > > ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------