Scott,

Thanks so much for your thoughtful response.  Things can get complicated
(especially with full text searching) when using formatted text.

What I really want to do is retain the formatting, to display in a HTML
file.  The text out of the database would be wrapped by a basic HTML outter
document, and the formatted text would mainly reference style sheets.  This
is for a desktop application, so I am not trying to target a browser.

I was thinking it is ok to limit myself to full open and close tags, and
also to not have a single word be broken up, as you illustrated nicely.

Thereby, I was thinking if I just filtered out (or removed) the basic tags I
am using, such as:

Some text. <p id="style1">this is more text.</p>  Followed by more.

by simply not putting the tags into the index (it technically wouldn't be
fully formed xml with a header and such) , it would still retain the text
for searching but ignore the tags.  I agree that trying to handle generic
HTML would be a bear.  The content is not coming from the web, so I have
control over that as well.  I guess there could be problems with snippets or
with the offsets of the text.

I'll have to take a closer look into the provided parsers, and maybe there
is a way that any tags, could just *not *be put into the index.

I read the documents pointed out and the readme.  It looks like most of the
info on it is the source code, so I will need to take some time to dig
deeper and understand it.  I am hoping to start with the current source code
and add a filter.

I read that some folks from Google did some of the work on the fts3 module.
They probably drop all tags when they search for text, and that allows for a
more reliable search.  I'm just guessing.

Thanks again,
Paul

On Tue, Mar 10, 2009 at 11:22 AM, Scott Hess <[email protected]> wrote:

> The fts module doesn't do anything "interesting" with embedded
> meta-data in the interests of simplicity.  Stripping the info out
> before inserting is probably easiest, but has the downsides of
> duplication (assuming you need to keep the raw data elsewhere), and it
> means that queries involving snippets and the like may be funky.
> Probably the best way to go about this would be to convert all tags to
> a single whitespace character.
>
> Building a custom tokenizer is certainly doable, but could be a
> frustrating goal if you intend to process generic HTML you find on the
> web, just because of the number of heuristics you'll have to layer in.
>  As a first pass, you might just treat tags as word breaks as you
> iterate over the input.  But there definitely are cases where HTML
> markup happens within words, so you might need something a bit more
> sophisticated.  There is some level of support for returning tokens
> which are not literally present in the input.  For instance, for the
> input 't<b>h</b>is' you could return 'this' and indicate that it
> corresponds to 11 characters in the input, and everything should work.
>  I'm not sure anyone has ever exercised this aspect of things
> strongly, though, so it's possible that things don't work as intended
> when you do that.
>
> Before going either direction, you should probably sit down and figure
> out what exactly you're going to do with the results you get from the
> table.  If you want to, say, present them on a web page, then your
> problems are just beginning, because the tag nesting will open up
> layout issues and security problems.  It may be that thinking through
> that part of the system will help you figure out an appropriate
> approach for this part of the system (for instance, if you decide to
> strip tags for other reasons, then it all becomes easy!).
>
> -scott
>
>
> On Tue, Mar 10, 2009 at 6:56 AM, Paul Perry <[email protected]> wrote:
> > Thank you for the pointers Alexandre and Alexey.
> >
> > I spent about 30 minutes looking into the parser, and it looks like it is
> a
> > possibility.  I'll require a more in-depth understanding in order to do
> > this.  I would probably start with the simple parser, and go from there.
> >
> >> I think to prepare html before insert is more simple. You can transform
> >
> > html into "right" format for fts3 parser.
> > I would actually like to retain the tagged (html) formatting in the
> > database, thereby, when it is retrieved it can be displayed as rich text.
> >
> > Thanks,
> > Paul
> >
> >
> >
> > On Tue, Mar 10, 2009 at 4:32 AM, Alexey Pechnikov <
> [email protected]>wrote:
> >
> >> Hello!
> >>
> >> On Tuesday 10 March 2009 06:16:16 Alexandre Courbot wrote:
> >> > Never did this myself, but I think you can do what you need by writing
> >> > your own tokenizer:
> >> >
> >> >
> >>
> http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/README.tokenizers
> >>
> >> It's not good advice for a few documented module.
> >>
> >> I think to prepare html before insert is more simple. You can transform
> >> html
> >> into "right" format for fts3 parser.
> >>
> >> Best regards.
> >>  _______________________________________________
> >> sqlite-users mailing list
> >> [email protected]
> >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >>
> > _______________________________________________
> > sqlite-users mailing list
> > [email protected]
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to