Re: [sqlite] FTS statistics and stemming

Stephen Woodbridge Sat, 26 Jul 2008 12:28:23 -0700

Alexey Pechnikov wrote:
> Hello!
> 
> В сообщении от Saturday 26 July 2008 21:37:19 Stephen Woodbridge написал(а):
>> I have thought a lot about these issues and would appreciate any
>> thoughts or ideas on how to implement any of these concepts or others
>> for fuzzy searching and matching.
> 
> I'm know that ispell, myspell, hunspell and trigrams are used in PostgreSQL 
> FTS. A lot of languages are supported this. And soundex function useful for 
> morphology search if to write word by latin alphabet (transliteration by 
> replace each symbol of national alphabet by one or more latin):
> 
> sqlite> select soundex('Moskva');
> M210
> sqlite> select soundex('Moscva');
> M210
> sqlite> select soundex('Mouscva');
> M210
> sqlite> select soundex('Mouskva');
> M210
> sqlite> select soundex('moskva');
> M210
> 
> Note: compile SQLite with -DSQLITE_SOUNDEX=1
> 
> There is stemming in Apache Lucene, Sphinx (included morphology by soundex) 
> and Xapian too.
> 
> Are these futures planned to be in SQLIte FTS?


Well, I will leave the question of plans to Scott Hess the FTS developer 
to answer.

I just read a bunch of the FTS overview documents for Postgresql, which 
I use a lot for other projects and I like the way they have things 
broken down and integrated with the database. I haven't tried 8.3 yet, 
but it is nice to see the FTS is now part of the main distribution.

http://www.sai.msu.su/~megera/postgres/fts/doc/fts-history.html
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-basic.html
http://www.sai.msu.su/~megera/postgres/fts/doc/fts-dict.html

I think you can add dictionaries as stemmers the same way you would add 
a stemmer to SQLlite. Look at the code in the SQLite source tree:

ext/fts3/fts3_porter.c
ext/fts3/fts3_tokenizer.[ch]
ext/fts3/fts3_tokenizer1.c

As far as other lexemes, there is nothing stopping you from creating 
your FTS table with additional lexeme columns that you can populate with 
the appropriate lexemes from the full text column. Of course, you have 
to generate the lexemes yourself and add them as the text for that column.

For example if you wanted to have a soundex column, you could preprocess 
you document through the simple or porter stemmer and then take each of 
the tokens and generate the soundex key for them and concatenate them 
all with a separating space and then use that as the contents for the 
soundex lexeme column. Then to do a query, you would tokenize the 
incoming words, generate the soundex keys and to an FTS search on that 
column. It would obviously be nicer if this was built into the existing 
FTS engine, but you could do it today with some additional programming.

As I said before, I will leave questions of planning for FTS up to 
Scott. I have read through his fts3 code, but I confess I do not 
understand how it all works, but a relatively small amount of code it 
works impressively well.

All the best,
   -Steve
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] FTS statistics and stemming

Reply via email to