I've realized that this proposal does not easily allow for tables without per-row tokenizers. I think that modifying the create syntax to be like:
CREATE VIRTUAL TABLE t USING fts3(TOKENIZER DEFAULT <tokenizer>, ...); could handle that. If you use "TOKENIZER <tokenizer>", then you get the current behaviour, with a fixed tokenizer for the entire table. For consistency, the implicit column could still be defined, but be fixed to the value <tokenizer>. Inserts or updates which set it that column to <tokenizer> or NULL would continue to work, but any other value would throw an error. With the addition of "DEFAULT" (or other suggestive word TBD), you're indicating that you want the ability to customize the tokenizer per row. -scott On 9/17/07, Scott Hess <[EMAIL PROTECTED]> wrote: > As part of doing internationalization work on Gears, it has been > determined that it is unlikely that you can just define a global > tokenizer that will work for everything. Instead, in some cases you > may need to use a specific tokenizer, based on the content being > tokenized, or the source of the content. This can be emulated by > using multiple tables and complicating your joins, but it would be > nicer if fts could just accommodate this use case. > > In the interests of not committing something that people won't like, > my current proposal would be to add an implicit TOKENIZER column, > which will override the table's default tokenizer for that row. So, > you could do something like: > > CREATE VIRTUAL TABLE t USING fts3(TOKENIZER icu(en), content); > INSERT INTO t VALUES ('testing testing'); -- Uses icu(en). > INSERT INTO t (tokenizer, content) VALUES ('icu(kr)', '나의'); -- > Uses icu(kr). > SELECT rowid FROM t WHERE t MATCH 'TOKENIZER:icu(kr) 의'; > > [Forgive me if you can read Korean and I just did something offensive. > I'm doing copy/paste, here!] > > fts allows for anything starting with 'tokenize' in that location in > the CREATE statement, so in the above all uses must match. If you > used "TOKENIZE" in the create, you use "TOKENIZE" everywhere else. In > MATCH, it must be the uppercase term use from the create (the other > places are case-insensitive), followed by : followed by a valid > tokenizer name followed by an optional parameter list. > > A variant which I think is somewhat interesting would be: > > CREATE VIRTUAL TABLE t USING fts3(tokenizer TOKENIZER DEFAULT > icu(en), content); > > This makes the "tokenizer" column a bit more explicit, and the > 'DEFAULT ...' syntax makes it clear what's going on, but I couldn't > really think of any other sensible name for the column, so it also > feels redundent. Since 'tokenize' is already a reserved prefix for > fts, I'm inclined towards the first variant. > > Opinions? > > -scott >