Scott, Thanks so much for your thoughtful response. Things can get complicated (especially with full text searching) when using formatted text.
What I really want to do is retain the formatting, to display in a HTML file. The text out of the database would be wrapped by a basic HTML outter document, and the formatted text would mainly reference style sheets. This is for a desktop application, so I am not trying to target a browser. I was thinking it is ok to limit myself to full open and close tags, and also to not have a single word be broken up, as you illustrated nicely. Thereby, I was thinking if I just filtered out (or removed) the basic tags I am using, such as: Some text. <p id="style1">this is more text.</p> Followed by more. by simply not putting the tags into the index (it technically wouldn't be fully formed xml with a header and such) , it would still retain the text for searching but ignore the tags. I agree that trying to handle generic HTML would be a bear. The content is not coming from the web, so I have control over that as well. I guess there could be problems with snippets or with the offsets of the text. I'll have to take a closer look into the provided parsers, and maybe there is a way that any tags, could just *not *be put into the index. I read the documents pointed out and the readme. It looks like most of the info on it is the source code, so I will need to take some time to dig deeper and understand it. I am hoping to start with the current source code and add a filter. I read that some folks from Google did some of the work on the fts3 module. They probably drop all tags when they search for text, and that allows for a more reliable search. I'm just guessing. Thanks again, Paul On Tue, Mar 10, 2009 at 11:22 AM, Scott Hess <[email protected]> wrote: > The fts module doesn't do anything "interesting" with embedded > meta-data in the interests of simplicity. Stripping the info out > before inserting is probably easiest, but has the downsides of > duplication (assuming you need to keep the raw data elsewhere), and it > means that queries involving snippets and the like may be funky. > Probably the best way to go about this would be to convert all tags to > a single whitespace character. > > Building a custom tokenizer is certainly doable, but could be a > frustrating goal if you intend to process generic HTML you find on the > web, just because of the number of heuristics you'll have to layer in. > As a first pass, you might just treat tags as word breaks as you > iterate over the input. But there definitely are cases where HTML > markup happens within words, so you might need something a bit more > sophisticated. There is some level of support for returning tokens > which are not literally present in the input. For instance, for the > input 't<b>h</b>is' you could return 'this' and indicate that it > corresponds to 11 characters in the input, and everything should work. > I'm not sure anyone has ever exercised this aspect of things > strongly, though, so it's possible that things don't work as intended > when you do that. > > Before going either direction, you should probably sit down and figure > out what exactly you're going to do with the results you get from the > table. If you want to, say, present them on a web page, then your > problems are just beginning, because the tag nesting will open up > layout issues and security problems. It may be that thinking through > that part of the system will help you figure out an appropriate > approach for this part of the system (for instance, if you decide to > strip tags for other reasons, then it all becomes easy!). > > -scott > > > On Tue, Mar 10, 2009 at 6:56 AM, Paul Perry <[email protected]> wrote: > > Thank you for the pointers Alexandre and Alexey. > > > > I spent about 30 minutes looking into the parser, and it looks like it is > a > > possibility. I'll require a more in-depth understanding in order to do > > this. I would probably start with the simple parser, and go from there. > > > >> I think to prepare html before insert is more simple. You can transform > > > > html into "right" format for fts3 parser. > > I would actually like to retain the tagged (html) formatting in the > > database, thereby, when it is retrieved it can be displayed as rich text. > > > > Thanks, > > Paul > > > > > > > > On Tue, Mar 10, 2009 at 4:32 AM, Alexey Pechnikov < > [email protected]>wrote: > > > >> Hello! > >> > >> On Tuesday 10 March 2009 06:16:16 Alexandre Courbot wrote: > >> > Never did this myself, but I think you can do what you need by writing > >> > your own tokenizer: > >> > > >> > > >> > http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers > >> > >> It's not good advice for a few documented module. > >> > >> I think to prepare html before insert is more simple. You can transform > >> html > >> into "right" format for fts3 parser. > >> > >> Best regards. > >> _______________________________________________ > >> sqlite-users mailing list > >> [email protected] > >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > >> > > _______________________________________________ > > sqlite-users mailing list > > [email protected] > > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > > > _______________________________________________ > sqlite-users mailing list > [email protected] > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

