The fts module doesn't do anything "interesting" with embedded
meta-data in the interests of simplicity.  Stripping the info out
before inserting is probably easiest, but has the downsides of
duplication (assuming you need to keep the raw data elsewhere), and it
means that queries involving snippets and the like may be funky.
Probably the best way to go about this would be to convert all tags to
a single whitespace character.

Building a custom tokenizer is certainly doable, but could be a
frustrating goal if you intend to process generic HTML you find on the
web, just because of the number of heuristics you'll have to layer in.
 As a first pass, you might just treat tags as word breaks as you
iterate over the input.  But there definitely are cases where HTML
markup happens within words, so you might need something a bit more
sophisticated.  There is some level of support for returning tokens
which are not literally present in the input.  For instance, for the
input 't<b>h</b>is' you could return 'this' and indicate that it
corresponds to 11 characters in the input, and everything should work.
 I'm not sure anyone has ever exercised this aspect of things
strongly, though, so it's possible that things don't work as intended
when you do that.

Before going either direction, you should probably sit down and figure
out what exactly you're going to do with the results you get from the
table.  If you want to, say, present them on a web page, then your
problems are just beginning, because the tag nesting will open up
layout issues and security problems.  It may be that thinking through
that part of the system will help you figure out an appropriate
approach for this part of the system (for instance, if you decide to
strip tags for other reasons, then it all becomes easy!).

-scott


On Tue, Mar 10, 2009 at 6:56 AM, Paul Perry <[email protected]> wrote:
> Thank you for the pointers Alexandre and Alexey.
>
> I spent about 30 minutes looking into the parser, and it looks like it is a
> possibility.  I'll require a more in-depth understanding in order to do
> this.  I would probably start with the simple parser, and go from there.
>
>> I think to prepare html before insert is more simple. You can transform >
> html into "right" format for fts3 parser.
> I would actually like to retain the tagged (html) formatting in the
> database, thereby, when it is retrieved it can be displayed as rich text.
>
> Thanks,
> Paul
>
>
>
> On Tue, Mar 10, 2009 at 4:32 AM, Alexey Pechnikov 
> <[email protected]>wrote:
>
>> Hello!
>>
>> On Tuesday 10 March 2009 06:16:16 Alexandre Courbot wrote:
>> > Never did this myself, but I think you can do what you need by writing
>> > your own tokenizer:
>> >
>> >
>> http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
>>
>> It's not good advice for a few documented module.
>>
>> I think to prepare html before insert is more simple. You can transform
>> html
>> into "right" format for fts3 parser.
>>
>> Best regards.
>>  _______________________________________________
>> sqlite-users mailing list
>> [email protected]
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to