The fts module doesn't do anything "interesting" with embedded meta-data in the interests of simplicity. Stripping the info out before inserting is probably easiest, but has the downsides of duplication (assuming you need to keep the raw data elsewhere), and it means that queries involving snippets and the like may be funky. Probably the best way to go about this would be to convert all tags to a single whitespace character.
Building a custom tokenizer is certainly doable, but could be a frustrating goal if you intend to process generic HTML you find on the web, just because of the number of heuristics you'll have to layer in. As a first pass, you might just treat tags as word breaks as you iterate over the input. But there definitely are cases where HTML markup happens within words, so you might need something a bit more sophisticated. There is some level of support for returning tokens which are not literally present in the input. For instance, for the input 't<b>h</b>is' you could return 'this' and indicate that it corresponds to 11 characters in the input, and everything should work. I'm not sure anyone has ever exercised this aspect of things strongly, though, so it's possible that things don't work as intended when you do that. Before going either direction, you should probably sit down and figure out what exactly you're going to do with the results you get from the table. If you want to, say, present them on a web page, then your problems are just beginning, because the tag nesting will open up layout issues and security problems. It may be that thinking through that part of the system will help you figure out an appropriate approach for this part of the system (for instance, if you decide to strip tags for other reasons, then it all becomes easy!). -scott On Tue, Mar 10, 2009 at 6:56 AM, Paul Perry <[email protected]> wrote: > Thank you for the pointers Alexandre and Alexey. > > I spent about 30 minutes looking into the parser, and it looks like it is a > possibility. I'll require a more in-depth understanding in order to do > this. I would probably start with the simple parser, and go from there. > >> I think to prepare html before insert is more simple. You can transform > > html into "right" format for fts3 parser. > I would actually like to retain the tagged (html) formatting in the > database, thereby, when it is retrieved it can be displayed as rich text. > > Thanks, > Paul > > > > On Tue, Mar 10, 2009 at 4:32 AM, Alexey Pechnikov > <[email protected]>wrote: > >> Hello! >> >> On Tuesday 10 March 2009 06:16:16 Alexandre Courbot wrote: >> > Never did this myself, but I think you can do what you need by writing >> > your own tokenizer: >> > >> > >> http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers >> >> It's not good advice for a few documented module. >> >> I think to prepare html before insert is more simple. You can transform >> html >> into "right" format for fts3 parser. >> >> Best regards. >> _______________________________________________ >> sqlite-users mailing list >> [email protected] >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > _______________________________________________ > sqlite-users mailing list > [email protected] > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

