Re: [sqlite] FTS2 suggestion

2007-08-29 Thread brian kruse
On 8/29/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> What was fts3 will now be fts4.  fts3 will now be
> fts2-with-rowid-fixed.  fts3 is already in the tree, but with an
> #error at the top to force people to not use it without reading a
> comment.  I was planning to turn that off this week (what with the
> SQLite 3.5 stuff going on, might as well!).

Thanks very much for the prompt and detailed update!

-B

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-29 Thread Scott Hess
Hmm, and a clarification on the n-gram case ... there are no current
plans to implement any n-gram capabilities in fts.  This kind of thing
has been discussed, but since it still seems like a nice-to-have type
thing and not a must-have type thing, no time is being spent on it.  I
have somewhat of a suspicion that this kind of index requires a
materially different model than fts has been using, which might
encourage it to be a completely different virtual table.

-scott


On 8/29/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> A primary constraint of the porter algorithm in fts is that it's
> completely unencumbered open-source.  That may-or-may-not make it a
> great stemmer, of course :-).  One of the reasons it's in there in the
> first place is as an example of an alternative to the very basic
> "simple" fts tokenizer.  One of the near-term goals with Google Gears
> is to improve the tokenizer, and that will probably extend benefits
> out to fts (since Google Gears is also open-source).
>
> Thanks for the link, I'm always looking for reading material!
>
> As far as SQLite having inbuilt search, some projects (Google Gears,
> for example) wanted to use SQLite for reasons other than fulltext
> search.  Rather than try to integrate two distinct projects, we
> decided that it might be cleaner to just make one project a strict
> subsidiary of the other.  So you get fts basically for free once
> you've integrated SQLite into your project.  A side benefit is that
> you don't have to make decisions about where to store your index data,
> and there are no problems with making sure index data and database
> data conform to the same transaction model, these things just happen
> naturally.  This will hopefully make fulltext search more applicable
> in projects where searching is not the core functionality of the
> project.
>
> -scott
>
>
> On 8/29/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> > Hello Scott,
> >
> > I have several clarifications with respect to full text search. I'm a 
> > newbie in open source development, so please bear with me if some of the 
> > questions are irrelevant/obvious/nonsense.
> >
> > I was given to understand that the potter stemming algorithm implemented in 
> > fts2 is not robust enough (or rather snowball is more accurate). If fts2(or 
> > 3) has to be made more robust, then what should be the next step. The 
> > following url (I thought) gave the steps to follow rather succinctly:
> >
> > http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php
> >
> > At what stage would n-gram kick in (I assume n-gram would be in conjunction 
> > to snowball/potter). Which would be a good n-gram algorithm to implement.
> >
> > Finally, what's the rationale in having sqlite's own search. Why not use 
> > something like luceneC?
> >
> > Thanks in advance
> >
> > Uma
> >
> > Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in there.  
> > The main issue with Porter is
> > that it's English only.
> >
> > There is no general game-plan for fuzzy search at this time, though if
> > someone wants to step into the breech, go for it!  Even a prototype
> > which demonstrates the concepts and problems but isn't
> > production-ready would be worth something.
> >
> > My current focus for the next generation is international support
> > (this is more of a Google Gears project, but with focus on SQLite so
> > there is likely to be stuff checked in on the SQLite side), and more
> > scalable/manageable indexing.  Not a lot of focus on things like
> > quality and recall, mostly because I'm not aware of any major users
> > with enough of an installed baseline to even generate decent metrics.
> > [Basically, solving concrete identified problems rather than looking
> > for ill-defined potential problems.]
> >
> > -scott
> >
> >
> > On 8/24/07, Uma Krishnan  wrote:
> > > Would it not be more useful to first implement potter stemmer algorithm, 
> > > and then to implement n-gram (as I understand n-gram is for cross column 
> > > fuzzy search?). What is the general game plan for FTS3 with regard to 
> > > fuzzy search?
> > >
> > >   Thanks in advance
> > >
> > > "Cesar D. Rodas"  wrote:
> > >   On 23/08/07, Scott Hess wrote:
> > > > On 8/20/07, Cesar D. Rodas wrote:
> > > > > As I know ( I can be wrong ) SQLite Full Text Search is only match 
> > > > > with hole
> > > > > words right? It could not be
> > > > > And also no FT extension to db ( as far I know) is miss spell 
> > > > > tolerant,
> > > >
> > > > Yes, fts is matching exactly. There is some primitive support for
> > > > English stemming using the Porter stemmer, but, honestly, it's not
> > > > well-exercised.
> > > >
> > > > > And
> > > > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > > > N-Gram
> > > > > Lists For Efficient Inexact Matching*
> > > >
> > > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > 

Re: [sqlite] FTS2 suggestion

2007-08-29 Thread Scott Hess
What was fts3 will now be fts4.  fts3 will now be
fts2-with-rowid-fixed.  fts3 is already in the tree, but with an
#error at the top to force people to not use it without reading a
comment.  I was planning to turn that off this week (what with the
SQLite 3.5 stuff going on, might as well!).

The next generation of fts has been 6 weeks out for ... the entire
year.  Sigh.  At this time it's my highest priority, though, and I'm
not really supposed to be working on anything else, so I'm hopeful
that there will be substantial code checked in by the end of
September.  Going by the fts2 experience, it will probably need 2 or 3
weeks beyond that to really settle into a usable state.

-scott


On 8/29/07, brian kruse <[EMAIL PROTECTED]> wrote:
> On 8/24/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> >
> > My current focus for the next generation is international support
> > (this is more of a Google Gears project, but with focus on SQLite so
> > there is likely to be stuff checked in on the SQLite side), and more
> > scalable/manageable indexing.
>
> Thanks for the update Scott. Given that FTS3 will also presumably fix
> the VACUUM FTS1/2 bug, do you have a timeline for the FTS3 release?
>
> Even a estimate with a +/- 30 day granularity would be nice.
>
> Kind regards,
> -B
>
> Not a lot of focus on things like
> > quality and recall, mostly because I'm not aware of any major users
> > with enough of an installed baseline to even generate decent metrics.
> > [Basically, solving concrete identified problems rather than looking
> > for ill-defined potential problems.]
> >
> > -scott
> >
> >
> > On 8/24/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> > > Would it not be more useful to first implement potter stemmer algorithm, 
> > > and then to implement n-gram (as I understand n-gram is for cross column 
> > > fuzzy search?). What is the general game plan for FTS3 with regard to 
> > > fuzzy search?
> > >
> > >   Thanks in advance
> > >
> > > "Cesar D. Rodas" <[EMAIL PROTECTED]> wrote:
> > >   On 23/08/07, Scott Hess wrote:
> > > > On 8/20/07, Cesar D. Rodas wrote:
> > > > > As I know ( I can be wrong ) SQLite Full Text Search is only match 
> > > > > with hole
> > > > > words right? It could not be
> > > > > And also no FT extension to db ( as far I know) is miss spell 
> > > > > tolerant,
> > > >
> > > > Yes, fts is matching exactly. There is some primitive support for
> > > > English stemming using the Porter stemmer, but, honestly, it's not
> > > > well-exercised.
> > > >
> > > > > And
> > > > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > > > N-Gram
> > > > > Lists For Efficient Inexact Matching*
> > > >
> > > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > > >
> > > > > I was reading and it is not so hard to implement, but it cost a extra
> > > > > storage space, but I think the benefits are more.
> > > > >
> > > > > Also following this paper could be done a way to match with fragments 
> > > > > of
> > > > > words... what do you think of it?
> > > >
> > > > It's an interesting paper, and I must say that anything which involves
> > > > Bloom Filters automatically draws my attention :-).
> > >
> > > Yeah. I am doing some investigations about that, I love that too. And
> > > I was watching that with n-grams you get a filter to stop common
> > > words, and could be used as a stemming-like algorithm but independent
> > > from the language.
> > >
> > > I was thinking to implement this
> > > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> > > when I finish up some things. What do you think of it?
> > >
> > > > While I think spelling-suggestion might be valuable for fts in the
> > > > longer term, I'm not very enthusiastic about this particular model.
> > > > It seems much more useful in the standard indexing model of building
> > > > the index, manually tweaking it, and then doing a ton of queries
> > > > against it. fts is really fairly constrained, because many use-cases
> > > > are more along the lines of update the index quite a bit, and query it
> > > > only a few times.
> > > >
> > > > Also, I think the concepts in the paper might have very significant
> > > > problems handling Unicode, because the bit vectors will get so very
> > > > large. I may be wrong, sometimes the overlapping-vector approach can
> > > > have surprising relevance depending on the frequency distribution of
> > > > the things in the vector. It would need some experimentation to
> > > > figure that out.
> > > >
> > > > Certainly something to bookmark, though.
> > > >
> > > > Thanks,
> > > > scott
> > > >
> > > > -
> > > > To unsubscribe, send email to [EMAIL PROTECTED]
> > > > -
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cesar D. Rodas
> > > 

Re: [sqlite] FTS2 suggestion

2007-08-29 Thread Scott Hess
A primary constraint of the porter algorithm in fts is that it's
completely unencumbered open-source.  That may-or-may-not make it a
great stemmer, of course :-).  One of the reasons it's in there in the
first place is as an example of an alternative to the very basic
"simple" fts tokenizer.  One of the near-term goals with Google Gears
is to improve the tokenizer, and that will probably extend benefits
out to fts (since Google Gears is also open-source).

Thanks for the link, I'm always looking for reading material!

As far as SQLite having inbuilt search, some projects (Google Gears,
for example) wanted to use SQLite for reasons other than fulltext
search.  Rather than try to integrate two distinct projects, we
decided that it might be cleaner to just make one project a strict
subsidiary of the other.  So you get fts basically for free once
you've integrated SQLite into your project.  A side benefit is that
you don't have to make decisions about where to store your index data,
and there are no problems with making sure index data and database
data conform to the same transaction model, these things just happen
naturally.  This will hopefully make fulltext search more applicable
in projects where searching is not the core functionality of the
project.

-scott


On 8/29/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> Hello Scott,
>
> I have several clarifications with respect to full text search. I'm a newbie 
> in open source development, so please bear with me if some of the questions 
> are irrelevant/obvious/nonsense.
>
> I was given to understand that the potter stemming algorithm implemented in 
> fts2 is not robust enough (or rather snowball is more accurate). If fts2(or 
> 3) has to be made more robust, then what should be the next step. The 
> following url (I thought) gave the steps to follow rather succinctly:
>
> http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php
>
> At what stage would n-gram kick in (I assume n-gram would be in conjunction 
> to snowball/potter). Which would be a good n-gram algorithm to implement.
>
> Finally, what's the rationale in having sqlite's own search. Why not use 
> something like luceneC?
>
> Thanks in advance
>
> Uma
>
> Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in there.  
> The main issue with Porter is
> that it's English only.
>
> There is no general game-plan for fuzzy search at this time, though if
> someone wants to step into the breech, go for it!  Even a prototype
> which demonstrates the concepts and problems but isn't
> production-ready would be worth something.
>
> My current focus for the next generation is international support
> (this is more of a Google Gears project, but with focus on SQLite so
> there is likely to be stuff checked in on the SQLite side), and more
> scalable/manageable indexing.  Not a lot of focus on things like
> quality and recall, mostly because I'm not aware of any major users
> with enough of an installed baseline to even generate decent metrics.
> [Basically, solving concrete identified problems rather than looking
> for ill-defined potential problems.]
>
> -scott
>
>
> On 8/24/07, Uma Krishnan  wrote:
> > Would it not be more useful to first implement potter stemmer algorithm, 
> > and then to implement n-gram (as I understand n-gram is for cross column 
> > fuzzy search?). What is the general game plan for FTS3 with regard to fuzzy 
> > search?
> >
> >   Thanks in advance
> >
> > "Cesar D. Rodas"  wrote:
> >   On 23/08/07, Scott Hess wrote:
> > > On 8/20/07, Cesar D. Rodas wrote:
> > > > As I know ( I can be wrong ) SQLite Full Text Search is only match with 
> > > > hole
> > > > words right? It could not be
> > > > And also no FT extension to db ( as far I know) is miss spell tolerant,
> > >
> > > Yes, fts is matching exactly. There is some primitive support for
> > > English stemming using the Porter stemmer, but, honestly, it's not
> > > well-exercised.
> > >
> > > > And
> > > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > > N-Gram
> > > > Lists For Efficient Inexact Matching*
> > >
> > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > >
> > > > I was reading and it is not so hard to implement, but it cost a extra
> > > > storage space, but I think the benefits are more.
> > > >
> > > > Also following this paper could be done a way to match with fragments of
> > > > words... what do you think of it?
> > >
> > > It's an interesting paper, and I must say that anything which involves
> > > Bloom Filters automatically draws my attention :-).
> >
> > Yeah. I am doing some investigations about that, I love that too. And
> > I was watching that with n-grams you get a filter to stop common
> > words, and could be used as a stemming-like algorithm but independent
> > from the language.
> >
> > I was thinking to implement this
> > 

Re: [sqlite] FTS2 suggestion

2007-08-29 Thread brian kruse
On 8/24/07, Scott Hess <[EMAIL PROTECTED]> wrote:
>
> My current focus for the next generation is international support
> (this is more of a Google Gears project, but with focus on SQLite so
> there is likely to be stuff checked in on the SQLite side), and more
> scalable/manageable indexing.

Thanks for the update Scott. Given that FTS3 will also presumably fix
the VACUUM FTS1/2 bug, do you have a timeline for the FTS3 release?

Even a estimate with a +/- 30 day granularity would be nice.

Kind regards,
-B

Not a lot of focus on things like
> quality and recall, mostly because I'm not aware of any major users
> with enough of an installed baseline to even generate decent metrics.
> [Basically, solving concrete identified problems rather than looking
> for ill-defined potential problems.]
>
> -scott
>
>
> On 8/24/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> > Would it not be more useful to first implement potter stemmer algorithm, 
> > and then to implement n-gram (as I understand n-gram is for cross column 
> > fuzzy search?). What is the general game plan for FTS3 with regard to fuzzy 
> > search?
> >
> >   Thanks in advance
> >
> > "Cesar D. Rodas" <[EMAIL PROTECTED]> wrote:
> >   On 23/08/07, Scott Hess wrote:
> > > On 8/20/07, Cesar D. Rodas wrote:
> > > > As I know ( I can be wrong ) SQLite Full Text Search is only match with 
> > > > hole
> > > > words right? It could not be
> > > > And also no FT extension to db ( as far I know) is miss spell tolerant,
> > >
> > > Yes, fts is matching exactly. There is some primitive support for
> > > English stemming using the Porter stemmer, but, honestly, it's not
> > > well-exercised.
> > >
> > > > And
> > > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > > N-Gram
> > > > Lists For Efficient Inexact Matching*
> > >
> > > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > >
> > > > I was reading and it is not so hard to implement, but it cost a extra
> > > > storage space, but I think the benefits are more.
> > > >
> > > > Also following this paper could be done a way to match with fragments of
> > > > words... what do you think of it?
> > >
> > > It's an interesting paper, and I must say that anything which involves
> > > Bloom Filters automatically draws my attention :-).
> >
> > Yeah. I am doing some investigations about that, I love that too. And
> > I was watching that with n-grams you get a filter to stop common
> > words, and could be used as a stemming-like algorithm but independent
> > from the language.
> >
> > I was thinking to implement this
> > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> > when I finish up some things. What do you think of it?
> >
> > > While I think spelling-suggestion might be valuable for fts in the
> > > longer term, I'm not very enthusiastic about this particular model.
> > > It seems much more useful in the standard indexing model of building
> > > the index, manually tweaking it, and then doing a ton of queries
> > > against it. fts is really fairly constrained, because many use-cases
> > > are more along the lines of update the index quite a bit, and query it
> > > only a few times.
> > >
> > > Also, I think the concepts in the paper might have very significant
> > > problems handling Unicode, because the bit vectors will get so very
> > > large. I may be wrong, sometimes the overlapping-vector approach can
> > > have surprising relevance depending on the frequency distribution of
> > > the things in the vector. It would need some experimentation to
> > > figure that out.
> > >
> > > Certainly something to bookmark, though.
> > >
> > > Thanks,
> > > scott
> > >
> > > -
> > > To unsubscribe, send email to [EMAIL PROTECTED]
> > > -
> > >
> > >
> >
> >
> >
> > --
> > Cesar D. Rodas
> > http://www.cesarodas.com/
> > Mobile Phone: 595 961 974165
> > Phone: 595 21 645590
> > [EMAIL PROTECTED]
> > [EMAIL PROTECTED]
> >
> > -
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > -
> >
> >
> >
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-29 Thread Cesar D. Rodas
N-gram is a sequense of N Letters of a word or set of words...

http://en.wikipedia.org/wiki/N-gram



On 29/08/2007, Uma Krishnan <[EMAIL PROTECTED]> wrote:
>
> Hello Scott,
>
> I have several clarifications with respect to full text search. I'm a
> newbie in open source development, so please bear with me if some of the
> questions are irrelevant/obvious/nonsense.
>
> I was given to understand that the potter stemming algorithm implemented
> in fts2 is not robust enough (or rather snowball is more accurate). If
> fts2(or 3) has to be made more robust, then what should be the next step.
> The following url (I thought) gave the steps to follow rather succinctly:
>
>
> http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php
>
> At what stage would n-gram kick in (I assume n-gram would be in
> conjunction to snowball/potter). Which would be a good n-gram algorithm to
> implement.
>
> Finally, what's the rationale in having sqlite's own search. Why not use
> something like luceneC?
>
> Thanks in advance
>
> Uma
>
> Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in
> there.  The main issue with Porter is
> that it's English only.
>
> There is no general game-plan for fuzzy search at this time, though if
> someone wants to step into the breech, go for it!  Even a prototype
> which demonstrates the concepts and problems but isn't
> production-ready would be worth something.
>
> My current focus for the next generation is international support
> (this is more of a Google Gears project, but with focus on SQLite so
> there is likely to be stuff checked in on the SQLite side), and more
> scalable/manageable indexing.  Not a lot of focus on things like
> quality and recall, mostly because I'm not aware of any major users
> with enough of an installed baseline to even generate decent metrics.
> [Basically, solving concrete identified problems rather than looking
> for ill-defined potential problems.]
>
> -scott
>
>
> On 8/24/07, Uma Krishnan  wrote:
> > Would it not be more useful to first implement potter stemmer algorithm,
> and then to implement n-gram (as I understand n-gram is for cross column
> fuzzy search?). What is the general game plan for FTS3 with regard to fuzzy
> search?
> >
> >   Thanks in advance
> >
> > "Cesar D. Rodas"  wrote:
> >   On 23/08/07, Scott Hess wrote:
> > > On 8/20/07, Cesar D. Rodas wrote:
> > > > As I know ( I can be wrong ) SQLite Full Text Search is only match
> with hole
> > > > words right? It could not be
> > > > And also no FT extension to db ( as far I know) is miss spell
> tolerant,
> > >
> > > Yes, fts is matching exactly. There is some primitive support for
> > > English stemming using the Porter stemmer, but, honestly, it's not
> > > well-exercised.
> > >
> > > > And
> > > > I've found this Paper that talks about *Using Superimposed Coding Of
> N-Gram
> > > > Lists For Efficient Inexact Matching*
> > >
> > >
> http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > > >
> > > > I was reading and it is not so hard to implement, but it cost a
> extra
> > > > storage space, but I think the benefits are more.
> > > >
> > > > Also following this paper could be done a way to match with
> fragments of
> > > > words... what do you think of it?
> > >
> > > It's an interesting paper, and I must say that anything which involves
> > > Bloom Filters automatically draws my attention :-).
> >
> > Yeah. I am doing some investigations about that, I love that too. And
> > I was watching that with n-grams you get a filter to stop common
> > words, and could be used as a stemming-like algorithm but independent
> > from the language.
> >
> > I was thinking to implement this
> > http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> > when I finish up some things. What do you think of it?
> >
> > > While I think spelling-suggestion might be valuable for fts in the
> > > longer term, I'm not very enthusiastic about this particular model.
> > > It seems much more useful in the standard indexing model of building
> > > the index, manually tweaking it, and then doing a ton of queries
> > > against it. fts is really fairly constrained, because many use-cases
> > > are more along the lines of update the index quite a bit, and query it
> > > only a few times.
> > >
> > > Also, I think the concepts in the paper might have very significant
> > > problems handling Unicode, because the bit vectors will get so very
> > > large. I may be wrong, sometimes the overlapping-vector approach can
> > > have surprising relevance depending on the frequency distribution of
> > > the things in the vector. It would need some experimentation to
> > > figure that out.
> > >
> > > Certainly something to bookmark, though.
> > >
> > > Thanks,
> > > scott
> > >
> > >
> -
> > > To unsubscribe, send email to [EMAIL 

Re: [sqlite] FTS2 suggestion

2007-08-29 Thread Uma Krishnan
Hello Scott,

I have several clarifications with respect to full text search. I'm a newbie in 
open source development, so please bear with me if some of the questions are 
irrelevant/obvious/nonsense.

I was given to understand that the potter stemming algorithm implemented in 
fts2 is not robust enough (or rather snowball is more accurate). If fts2(or 3) 
has to be made more robust, then what should be the next step. The following 
url (I thought) gave the steps to follow rather succinctly:

http://web.njit.edu/~wu/teaching/CIS634/GoodProjects/AccessLisa/documentation.php

At what stage would n-gram kick in (I assume n-gram would be in conjunction to 
snowball/potter). Which would be a good n-gram algorithm to implement.

Finally, what's the rationale in having sqlite's own search. Why not use 
something like luceneC?

Thanks in advance

Uma

Scott Hess <[EMAIL PROTECTED]> wrote: Porter stemmer is already in there.  The 
main issue with Porter is
that it's English only.

There is no general game-plan for fuzzy search at this time, though if
someone wants to step into the breech, go for it!  Even a prototype
which demonstrates the concepts and problems but isn't
production-ready would be worth something.

My current focus for the next generation is international support
(this is more of a Google Gears project, but with focus on SQLite so
there is likely to be stuff checked in on the SQLite side), and more
scalable/manageable indexing.  Not a lot of focus on things like
quality and recall, mostly because I'm not aware of any major users
with enough of an installed baseline to even generate decent metrics.
[Basically, solving concrete identified problems rather than looking
for ill-defined potential problems.]

-scott


On 8/24/07, Uma Krishnan  wrote:
> Would it not be more useful to first implement potter stemmer algorithm, and 
> then to implement n-gram (as I understand n-gram is for cross column fuzzy 
> search?). What is the general game plan for FTS3 with regard to fuzzy search?
>
>   Thanks in advance
>
> "Cesar D. Rodas"  wrote:
>   On 23/08/07, Scott Hess wrote:
> > On 8/20/07, Cesar D. Rodas wrote:
> > > As I know ( I can be wrong ) SQLite Full Text Search is only match with 
> > > hole
> > > words right? It could not be
> > > And also no FT extension to db ( as far I know) is miss spell tolerant,
> >
> > Yes, fts is matching exactly. There is some primitive support for
> > English stemming using the Porter stemmer, but, honestly, it's not
> > well-exercised.
> >
> > > And
> > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > N-Gram
> > > Lists For Efficient Inexact Matching*
> >
> > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > >
> > > I was reading and it is not so hard to implement, but it cost a extra
> > > storage space, but I think the benefits are more.
> > >
> > > Also following this paper could be done a way to match with fragments of
> > > words... what do you think of it?
> >
> > It's an interesting paper, and I must say that anything which involves
> > Bloom Filters automatically draws my attention :-).
>
> Yeah. I am doing some investigations about that, I love that too. And
> I was watching that with n-grams you get a filter to stop common
> words, and could be used as a stemming-like algorithm but independent
> from the language.
>
> I was thinking to implement this
> http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> when I finish up some things. What do you think of it?
>
> > While I think spelling-suggestion might be valuable for fts in the
> > longer term, I'm not very enthusiastic about this particular model.
> > It seems much more useful in the standard indexing model of building
> > the index, manually tweaking it, and then doing a ton of queries
> > against it. fts is really fairly constrained, because many use-cases
> > are more along the lines of update the index quite a bit, and query it
> > only a few times.
> >
> > Also, I think the concepts in the paper might have very significant
> > problems handling Unicode, because the bit vectors will get so very
> > large. I may be wrong, sometimes the overlapping-vector approach can
> > have surprising relevance depending on the frequency distribution of
> > the things in the vector. It would need some experimentation to
> > figure that out.
> >
> > Certainly something to bookmark, though.
> >
> > Thanks,
> > scott
> >
> > -
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > -
> >
> >
>
>
>
> --
> Cesar D. Rodas
> http://www.cesarodas.com/
> Mobile Phone: 595 961 974165
> Phone: 595 21 645590
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
>
> -
> To unsubscribe, send 

Re: [sqlite] FTS2 suggestion

2007-08-24 Thread Scott Hess
Porter stemmer is already in there.  The main issue with Porter is
that it's English only.

There is no general game-plan for fuzzy search at this time, though if
someone wants to step into the breech, go for it!  Even a prototype
which demonstrates the concepts and problems but isn't
production-ready would be worth something.

My current focus for the next generation is international support
(this is more of a Google Gears project, but with focus on SQLite so
there is likely to be stuff checked in on the SQLite side), and more
scalable/manageable indexing.  Not a lot of focus on things like
quality and recall, mostly because I'm not aware of any major users
with enough of an installed baseline to even generate decent metrics.
[Basically, solving concrete identified problems rather than looking
for ill-defined potential problems.]

-scott


On 8/24/07, Uma Krishnan <[EMAIL PROTECTED]> wrote:
> Would it not be more useful to first implement potter stemmer algorithm, and 
> then to implement n-gram (as I understand n-gram is for cross column fuzzy 
> search?). What is the general game plan for FTS3 with regard to fuzzy search?
>
>   Thanks in advance
>
> "Cesar D. Rodas" <[EMAIL PROTECTED]> wrote:
>   On 23/08/07, Scott Hess wrote:
> > On 8/20/07, Cesar D. Rodas wrote:
> > > As I know ( I can be wrong ) SQLite Full Text Search is only match with 
> > > hole
> > > words right? It could not be
> > > And also no FT extension to db ( as far I know) is miss spell tolerant,
> >
> > Yes, fts is matching exactly. There is some primitive support for
> > English stemming using the Porter stemmer, but, honestly, it's not
> > well-exercised.
> >
> > > And
> > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > N-Gram
> > > Lists For Efficient Inexact Matching*
> >
> > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > >
> > > I was reading and it is not so hard to implement, but it cost a extra
> > > storage space, but I think the benefits are more.
> > >
> > > Also following this paper could be done a way to match with fragments of
> > > words... what do you think of it?
> >
> > It's an interesting paper, and I must say that anything which involves
> > Bloom Filters automatically draws my attention :-).
>
> Yeah. I am doing some investigations about that, I love that too. And
> I was watching that with n-grams you get a filter to stop common
> words, and could be used as a stemming-like algorithm but independent
> from the language.
>
> I was thinking to implement this
> http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> when I finish up some things. What do you think of it?
>
> > While I think spelling-suggestion might be valuable for fts in the
> > longer term, I'm not very enthusiastic about this particular model.
> > It seems much more useful in the standard indexing model of building
> > the index, manually tweaking it, and then doing a ton of queries
> > against it. fts is really fairly constrained, because many use-cases
> > are more along the lines of update the index quite a bit, and query it
> > only a few times.
> >
> > Also, I think the concepts in the paper might have very significant
> > problems handling Unicode, because the bit vectors will get so very
> > large. I may be wrong, sometimes the overlapping-vector approach can
> > have surprising relevance depending on the frequency distribution of
> > the things in the vector. It would need some experimentation to
> > figure that out.
> >
> > Certainly something to bookmark, though.
> >
> > Thanks,
> > scott
> >
> > -
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > -
> >
> >
>
>
>
> --
> Cesar D. Rodas
> http://www.cesarodas.com/
> Mobile Phone: 595 961 974165
> Phone: 595 21 645590
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>
>

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-24 Thread Uma Krishnan
Would it not be more useful to first implement potter stemmer algorithm, and 
then to implement n-gram (as I understand n-gram is for cross column fuzzy 
search?). What is the general game plan for FTS3 with regard to fuzzy search?
   
  Thanks in advance

"Cesar D. Rodas" <[EMAIL PROTECTED]> wrote:
  On 23/08/07, Scott Hess wrote:
> On 8/20/07, Cesar D. Rodas wrote:
> > As I know ( I can be wrong ) SQLite Full Text Search is only match with hole
> > words right? It could not be
> > And also no FT extension to db ( as far I know) is miss spell tolerant,
>
> Yes, fts is matching exactly. There is some primitive support for
> English stemming using the Porter stemmer, but, honestly, it's not
> well-exercised.
>
> > And
> > I've found this Paper that talks about *Using Superimposed Coding Of N-Gram
> > Lists For Efficient Inexact Matching*
>
> http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> >
> > I was reading and it is not so hard to implement, but it cost a extra
> > storage space, but I think the benefits are more.
> >
> > Also following this paper could be done a way to match with fragments of
> > words... what do you think of it?
>
> It's an interesting paper, and I must say that anything which involves
> Bloom Filters automatically draws my attention :-).

Yeah. I am doing some investigations about that, I love that too. And
I was watching that with n-grams you get a filter to stop common
words, and could be used as a stemming-like algorithm but independent
from the language.

I was thinking to implement this
http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
when I finish up some things. What do you think of it?

> While I think spelling-suggestion might be valuable for fts in the
> longer term, I'm not very enthusiastic about this particular model.
> It seems much more useful in the standard indexing model of building
> the index, manually tweaking it, and then doing a ton of queries
> against it. fts is really fairly constrained, because many use-cases
> are more along the lines of update the index quite a bit, and query it
> only a few times.
>
> Also, I think the concepts in the paper might have very significant
> problems handling Unicode, because the bit vectors will get so very
> large. I may be wrong, sometimes the overlapping-vector approach can
> have surprising relevance depending on the frequency distribution of
> the things in the vector. It would need some experimentation to
> figure that out.
>
> Certainly something to bookmark, though.
>
> Thanks,
> scott
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>



-- 
Cesar D. Rodas
http://www.cesarodas.com/
Mobile Phone: 595 961 974165
Phone: 595 21 645590
[EMAIL PROTECTED]
[EMAIL PROTECTED]

-
To unsubscribe, send email to [EMAIL PROTECTED]
-




Re: [sqlite] FTS2 suggestion

2007-08-23 Thread Cesar D. Rodas
I

On 23/08/07, Russell Leighton <[EMAIL PROTECTED]> wrote:
>
>
> Could fts3 (the next fts) have the option to override the default
> 'match' function with one passed in (similar to the tokenizer)?
>
> The reason I ask is then the fts table could be used as smart index
> when the tokenizer is
> something like bigram, trigram, etc. and the 'match' function computes
> a similarity metric
> and returns the row if above a threshold.
>
> Postgres does this when you declare an index of type trigram, see:
>
> http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm
>
> Since SQLite does not allow 'plug-in' indexes, the idea would be to
> create an fts3 table with a key back to the main table and the string
> column you want index.
> Indexing becomes a join through the fts3 table.
>
> You would probably want to allow the user to pass args to the 'match'
> function so a threshold could be set to non-default values and maybe
> tweak matching options
> specific to the match and tokenization.
>
> Thoughts?


I think this idea is great... If the ft3 has this optionality i could
rewrite the match function, I like the idea to give the possibility that
users can training  with data, and in database is where most data  are
store, and usually by categories, tags, or other system.

My goal is to give a set of data for learn, then in new inserts assign the
correct or closest category. And another feature that I want is that it
could learn about its mistakes (human assisted)

On Aug 23, 2007, at 4:56 PM, Scott Hess wrote:
>
> > On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
> >> As I know ( I can be wrong ) SQLite Full Text Search is only match
> >> with hole
> >> words right? It could not be
> >> And also no FT extension to db ( as far I know) is miss spell
> >> tolerant,
> >
> > Yes, fts is matching exactly.  There is some primitive support for
> > English stemming using the Porter stemmer, but, honestly, it's not
> > well-exercised.
> >
> >> And
> >> I've found this Paper that talks about *Using Superimposed Coding Of
> >> N-Gram
> >> Lists For Efficient Inexact Matching*
> >
> > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:
> > zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/
> > william92using.pdf
> >>
> >> I was reading and it is not so hard to implement, but it cost a extra
> >> storage space, but I think the benefits are more.
> >>
> >> Also following this paper could be done a way to match with fragments
> >> of
> >> words... what do you think of it?
> >
> > It's an interesting paper, and I must say that anything which involves
> > Bloom Filters automatically draws my attention :-).
> >
> > While I think spelling-suggestion might be valuable for fts in the
> > longer term, I'm not very enthusiastic about this particular model.
> > It seems much more useful in the standard indexing model of building
> > the index, manually tweaking it, and then doing a ton of queries
> > against it.  fts is really fairly constrained, because many use-cases
> > are more along the lines of update the index quite a bit, and query it
> > only a few times.
> >
> > Also, I think the concepts in the paper might have very significant
> > problems handling Unicode, because the bit vectors will get so very
> > large.  I may be wrong, sometimes the overlapping-vector approach can
> > have surprising relevance depending on the frequency distribution of
> > the things in the vector.  It would need some experimentation to
> > figure that out.
> >
> > Certainly something to bookmark, though.
> >
> > Thanks,
> > scott
> >
> > ---
> > --
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > ---
> > --
> >
>
>
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
>
> -
>
>


-- 
Cesar D. Rodas
http://www.cesarodas.com/
Mobile Phone: 595 961 974165
Phone: 595 21 645590
[EMAIL PROTECTED]
[EMAIL PROTECTED]


Re: [sqlite] FTS2 suggestion

2007-08-23 Thread Russell Leighton


Could fts3 (the next fts) have the option to override the default  
'match' function with one passed in (similar to the tokenizer)?


The reason I ask is then the fts table could be used as smart index  
when the tokenizer is
something like bigram, trigram, etc. and the 'match' function computes  
a similarity metric

and returns the row if above a threshold.

Postgres does this when you declare an index of type trigram, see:

http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm

Since SQLite does not allow 'plug-in' indexes, the idea would be to  
create an fts3 table with a key back to the main table and the string  
column you want index.

Indexing becomes a join through the fts3 table.

You would probably want to allow the user to pass args to the 'match'  
function so a threshold could be set to non-default values and maybe  
tweak matching options

specific to the match and tokenization.

Thoughts?


On Aug 23, 2007, at 4:56 PM, Scott Hess wrote:


On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
As I know ( I can be wrong ) SQLite Full Text Search is only match  
with hole

words right? It could not be
And also no FT extension to db ( as far I know) is miss spell  
tolerant,


Yes, fts is matching exactly.  There is some primitive support for
English stemming using the Porter stemmer, but, honestly, it's not
well-exercised.


And
I've found this Paper that talks about *Using Superimposed Coding Of  
N-Gram

Lists For Efficient Inexact Matching*


http://citeseer.ist.psu.edu/cache/papers/cs/22812/http: 
zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/ 
william92using.pdf


I was reading and it is not so hard to implement, but it cost a extra
storage space, but I think the benefits are more.

Also following this paper could be done a way to match with fragments  
of

words... what do you think of it?


It's an interesting paper, and I must say that anything which involves
Bloom Filters automatically draws my attention :-).

While I think spelling-suggestion might be valuable for fts in the
longer term, I'm not very enthusiastic about this particular model.
It seems much more useful in the standard indexing model of building
the index, manually tweaking it, and then doing a ton of queries
against it.  fts is really fairly constrained, because many use-cases
are more along the lines of update the index quite a bit, and query it
only a few times.

Also, I think the concepts in the paper might have very significant
problems handling Unicode, because the bit vectors will get so very
large.  I may be wrong, sometimes the overlapping-vector approach can
have surprising relevance depending on the frequency distribution of
the things in the vector.  It would need some experimentation to
figure that out.

Certainly something to bookmark, though.

Thanks,
scott

--- 
--

To unsubscribe, send email to [EMAIL PROTECTED]
--- 
--





-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-23 Thread Scott Hess
It's all interesting, but categorization is hard.  Not so hard to get
some results, sort of hard to get quality results.  Might work as a
nice adjunct to fts, so that you can throw the search terms into the
categorization engine and put up suggestions for re-running the search
with a tighter focus.

-scott


On 8/23/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
> On 23/08/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> > On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
> > > As I know ( I can be wrong ) SQLite Full Text Search is only match with 
> > > hole
> > > words right? It could not be
> > > And also no FT extension to db ( as far I know) is miss spell tolerant,
> >
> > Yes, fts is matching exactly.  There is some primitive support for
> > English stemming using the Porter stemmer, but, honestly, it's not
> > well-exercised.
> >
> > > And
> > > I've found this Paper that talks about *Using Superimposed Coding Of 
> > > N-Gram
> > > Lists For Efficient Inexact Matching*
> >
> > http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> > >
> > > I was reading and it is not so hard to implement, but it cost a extra
> > > storage space, but I think the benefits are more.
> > >
> > > Also following this paper could be done a way to match with fragments of
> > > words... what do you think of it?
> >
> > It's an interesting paper, and I must say that anything which involves
> > Bloom Filters automatically draws my attention :-).
>
> Yeah. I am doing some investigations about that, I love that too. And
> I was watching that with n-grams you get a filter to stop common
> words, and could be used as a stemming-like algorithm but independent
> from the language.
>
> I was thinking to implement this
> http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
> when I finish up some things. What do you think of it?
>
> > While I think spelling-suggestion might be valuable for fts in the
> > longer term, I'm not very enthusiastic about this particular model.
> > It seems much more useful in the standard indexing model of building
> > the index, manually tweaking it, and then doing a ton of queries
> > against it.  fts is really fairly constrained, because many use-cases
> > are more along the lines of update the index quite a bit, and query it
> > only a few times.
> >
> > Also, I think the concepts in the paper might have very significant
> > problems handling Unicode, because the bit vectors will get so very
> > large.  I may be wrong, sometimes the overlapping-vector approach can
> > have surprising relevance depending on the frequency distribution of
> > the things in the vector.  It would need some experimentation to
> > figure that out.
> >
> > Certainly something to bookmark, though.
> >
> > Thanks,
> > scott
> >
> > -
> > To unsubscribe, send email to [EMAIL PROTECTED]
> > -
> >
> >
>
>
>
> --
> Cesar D. Rodas
> http://www.cesarodas.com/
> Mobile Phone: 595 961 974165
> Phone: 595 21 645590
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-23 Thread Cesar D. Rodas
On 23/08/07, Scott Hess <[EMAIL PROTECTED]> wrote:
> On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
> > As I know ( I can be wrong ) SQLite Full Text Search is only match with hole
> > words right? It could not be
> > And also no FT extension to db ( as far I know) is miss spell tolerant,
>
> Yes, fts is matching exactly.  There is some primitive support for
> English stemming using the Porter stemmer, but, honestly, it's not
> well-exercised.
>
> > And
> > I've found this Paper that talks about *Using Superimposed Coding Of N-Gram
> > Lists For Efficient Inexact Matching*
>
> http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
> >
> > I was reading and it is not so hard to implement, but it cost a extra
> > storage space, but I think the benefits are more.
> >
> > Also following this paper could be done a way to match with fragments of
> > words... what do you think of it?
>
> It's an interesting paper, and I must say that anything which involves
> Bloom Filters automatically draws my attention :-).

Yeah. I am doing some investigations about that, I love that too. And
I was watching that with n-grams you get a filter to stop common
words, and could be used as a stemming-like algorithm but independent
from the language.

I was thinking to implement this
http://www.mail-archive.com/sqlite-users%40sqlite.org/msg26923.html
when I finish up some things. What do you think of it?

> While I think spelling-suggestion might be valuable for fts in the
> longer term, I'm not very enthusiastic about this particular model.
> It seems much more useful in the standard indexing model of building
> the index, manually tweaking it, and then doing a ton of queries
> against it.  fts is really fairly constrained, because many use-cases
> are more along the lines of update the index quite a bit, and query it
> only a few times.
>
> Also, I think the concepts in the paper might have very significant
> problems handling Unicode, because the bit vectors will get so very
> large.  I may be wrong, sometimes the overlapping-vector approach can
> have surprising relevance depending on the frequency distribution of
> the things in the vector.  It would need some experimentation to
> figure that out.
>
> Certainly something to bookmark, though.
>
> Thanks,
> scott
>
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
>
>



-- 
Cesar D. Rodas
http://www.cesarodas.com/
Mobile Phone: 595 961 974165
Phone: 595 21 645590
[EMAIL PROTECTED]
[EMAIL PROTECTED]

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS2 suggestion

2007-08-23 Thread Scott Hess
On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:
> As I know ( I can be wrong ) SQLite Full Text Search is only match with hole
> words right? It could not be
> And also no FT extension to db ( as far I know) is miss spell tolerant,

Yes, fts is matching exactly.  There is some primitive support for
English stemming using the Porter stemmer, but, honestly, it's not
well-exercised.

> And
> I've found this Paper that talks about *Using Superimposed Coding Of N-Gram
> Lists For Efficient Inexact Matching*

http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf
>
> I was reading and it is not so hard to implement, but it cost a extra
> storage space, but I think the benefits are more.
>
> Also following this paper could be done a way to match with fragments of
> words... what do you think of it?

It's an interesting paper, and I must say that anything which involves
Bloom Filters automatically draws my attention :-).

While I think spelling-suggestion might be valuable for fts in the
longer term, I'm not very enthusiastic about this particular model.
It seems much more useful in the standard indexing model of building
the index, manually tweaking it, and then doing a ton of queries
against it.  fts is really fairly constrained, because many use-cases
are more along the lines of update the index quite a bit, and query it
only a few times.

Also, I think the concepts in the paper might have very significant
problems handling Unicode, because the bit vectors will get so very
large.  I may be wrong, sometimes the overlapping-vector approach can
have surprising relevance depending on the frequency distribution of
the things in the vector.  It would need some experimentation to
figure that out.

Certainly something to bookmark, though.

Thanks,
scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-