Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2011-06-01 Thread Fedor Tyurin
Were you able to solve the problem? What solution have to chosen?

BR,
Fedor

I'd appreciate any suggestions on good ways to do this, I'm neither an SQL
or
sqlite expert, so I might be thinking about it all wrong.
I have something like a (read-only) address book/rolodex, with interactive
searching. As users type into the search box, I need to first know for each
section how many rows match the substring typed so far. I only display the
rows that are visible on screen.
I have two queries:
(A) I count the rows in a letter group.
If they typed "e":
select substr(name,1,1), count(*) from my_table where name like '%e%'
group by substr(name,1,1);
A|94
B|118
C|131
...
This is too slow, ~3sec, with 2500 rows, and we want to have 1 rows.
Worse, when they type "es", the search is as slow after they type "s" as
when
they typed "e", even though the "es" rows are a sub-set of the rows that
matched "e".
FTS3 only searches full terms/words by default, but I think if I built a
custom
tokenizer that returned all the suffix trees for a name:
"fu bar" => [ "r", "ar", "bar", " bar", "u bar", "fu bar"]
That I could do rewrite query (A) like this:
select substr(name,1,1), count(*) from my_table where name match 'e*'
group by substr(name,1,1);
Is this a reasonable approach? Is there a better way? Has somebody
else done this?


(B) I access specific rows within a letter group.
For visible rows, I fetch them by offset into a letter group, so row 4 in
the
"g" section of names containing "e" would be:
select * from my_table where name like "g%" and name like "%e%" order
by name limit 1 offset 4;
The performance for this is OK, right now, I think it's because the first
LIKE
can use the index, so the linear scan is over only a few hundred rows. Or it
could be that the on-screen display of each row is slower than the DB
search. I
think it might become a problem, though.
I'm not sure how I would rewrite this to use FTS3 if it turns out to be to
slow
for a larger DB, maybe a tokenizer that puts the first letter of the name as
the first letter of every suffix?
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-09 Thread Black, Michael (IS)
Sounds to me like Boyer-Moore is needed
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
 
And...I would probably pre-load the database table into 26 seperate memory 
tables to avoid any SQL interactivity at all other than the initial loading.  
Adding the SQL layer slows things down far too much.
 
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems
 



From: sqlite-users-boun...@sqlite.org on behalf of Tim Romano
Sent: Mon 8/9/2010 7:00 AM
To: General Discussion of SQLite Database
Subject: EXTERNAL:Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 
with suffix-tree tokenizer be the fast way?



First, permit me a little rant. As a user, I dislike this kind of
incremental search feature if there's no easy way to toggle it or to
configure it and the list of items will be large enough to cause a typing
lag. The feature can become an intrusive nuisance, the opposite of what is
intended.  Browsers put this feature on the URL address bar and Google has
it on its search-input. Keystrokes entered often get swallowed up. It's
worse than typing on a 300 baud dumb terminal, for at least on those ancient
machines your characters would eventually be displayed on the green screen,
whereas with today's browsers the characters often just get eaten; I find
myself having to retype the first few characters of a URL or search term far
too often.

I agree with Radzi's suggestion. Once you have the initial set of of hits
(rowid, name)  in an array, do the rest in procedurally rather than going
back against the database with a new SQL query and a longer search string.
That will be much faster that issuing a new SQL query after every keystroke.
 I would wait until the user had typed at least two characters before
kicking off the initial search because finding every value that contains a
common letter is not helpful when the list of matches is a very long one.

Regards
Tim Romano
Swarthmore PA




On Fri, Aug 6, 2010 at 9:54 PM, Scott Hess <sh...@google.com> wrote:

> On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts <vieuxt...@gmail.com> wrote:
> > On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess <sh...@google.com> wrote:
> >> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts <vieuxt...@gmail.com>
> wrote:
> >>> FTS3 only searches full terms/words by default, but I think if I built
> a custom
> >>> tokenizer that returned all the suffix trees for a name:
> >>
> >> FTS3 can do prefix searches, MATCH 'a*'.  Also, it aimed to support
> >
> > Prefix searches don't allow matching in the middle of words. For
> > example, I want  "bert"
> > to match my name, "roberts".
>
> Darn.  Sorry, was only thinking with half my brain, and that half
> connected your problem up with some past idea.  You're right, you'd
> need the tidbits to get at the interior substrings.
>
> That said, you should be able to pretty easily copy the current
> tokenizer and modify it to return multiple tokens at a single
> location.
>
> -scott
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-09 Thread Tim Romano
First, permit me a little rant. As a user, I dislike this kind of
incremental search feature if there's no easy way to toggle it or to
configure it and the list of items will be large enough to cause a typing
lag. The feature can become an intrusive nuisance, the opposite of what is
intended.  Browsers put this feature on the URL address bar and Google has
it on its search-input. Keystrokes entered often get swallowed up. It's
worse than typing on a 300 baud dumb terminal, for at least on those ancient
machines your characters would eventually be displayed on the green screen,
whereas with today's browsers the characters often just get eaten; I find
myself having to retype the first few characters of a URL or search term far
too often.

I agree with Radzi's suggestion. Once you have the initial set of of hits
(rowid, name)  in an array, do the rest in procedurally rather than going
back against the database with a new SQL query and a longer search string.
That will be much faster that issuing a new SQL query after every keystroke.
 I would wait until the user had typed at least two characters before
kicking off the initial search because finding every value that contains a
common letter is not helpful when the list of matches is a very long one.

Regards
Tim Romano
Swarthmore PA




On Fri, Aug 6, 2010 at 9:54 PM, Scott Hess  wrote:

> On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts  wrote:
> > On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess  wrote:
> >> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts 
> wrote:
> >>> FTS3 only searches full terms/words by default, but I think if I built
> a custom
> >>> tokenizer that returned all the suffix trees for a name:
> >>
> >> FTS3 can do prefix searches, MATCH 'a*'.  Also, it aimed to support
> >
> > Prefix searches don't allow matching in the middle of words. For
> > example, I want  "bert"
> > to match my name, "roberts".
>
> Darn.  Sorry, was only thinking with half my brain, and that half
> connected your problem up with some past idea.  You're right, you'd
> need the tidbits to get at the interior substrings.
>
> That said, you should be able to pretty easily copy the current
> tokenizer and modify it to return multiple tokens at a single
> location.
>
> -scott
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Scott Hess
On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts  wrote:
> On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess  wrote:
>> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts  wrote:
>>> FTS3 only searches full terms/words by default, but I think if I built a 
>>> custom
>>> tokenizer that returned all the suffix trees for a name:
>>
>> FTS3 can do prefix searches, MATCH 'a*'.  Also, it aimed to support
>
> Prefix searches don't allow matching in the middle of words. For
> example, I want  "bert"
> to match my name, "roberts".

Darn.  Sorry, was only thinking with half my brain, and that half
connected your problem up with some past idea.  You're right, you'd
need the tidbits to get at the interior substrings.

That said, you should be able to pretty easily copy the current
tokenizer and modify it to return multiple tokens at a single
location.

-scott
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Mohd Radzi Ibrahim
Have you not consider loading the whole rows into memory array and use simple 
string search or regexp? I'm sure 10,000 records could be search a blink.

best regards,
Radzi.
On 6-Aug-2010, at 3:42 AM, Sam Roberts wrote:

> I'd appreciate any suggestions on good ways to do this, I'm neither an SQL or
> sqlite expert, so I might be thinking about it all wrong.
> 
> I have something like a (read-only) address book/rolodex, with interactive
> searching. As users type into the search box, I need to first know for each
> section how many rows match the substring typed so far.  I only display the
> rows that are visible on screen.
> 
> I have two queries:
> 
> (A) I count the rows in a letter group.
> 
> If they typed "e":
> 
> select substr(name,1,1), count(*) from my_table where name like '%e%'
> group by substr(name,1,1);
> A|94
> B|118
> C|131
> ...
> 
> This is too slow, ~3sec, with 2500 rows, and we want to have 1 rows.
> 
> Worse, when they type "es", the search is as slow after they type "s" as when
> they typed "e", even though the "es" rows are a sub-set of the rows that
> matched "e".
> 
> FTS3 only searches full terms/words by default, but I think if I built a 
> custom
> tokenizer that returned all the suffix trees for a name:
> 
> "fu bar" => [ "r", "ar", "bar", " bar", "u bar", "fu bar"]
> 
> That I could do rewrite query (A) like this:
> 
> select substr(name,1,1), count(*) from my_table where name match 'e*'
> group by substr(name,1,1);
> 
> Is this a reasonable approach? Is there a better way? Has somebody
> else done this?
> 
> 
> 
> (B) I access specific rows within a letter group.
> 
> For visible rows, I fetch them by offset into a letter group, so row 4 in the
> "g" section of names containing "e" would be:
> 
> select * from my_table where name like "g%" and name like "%e%" order
> by name limit 1 offset 4;
> 
> The performance for this is OK, right now, I think it's because the first LIKE
> can use the index, so the linear scan is over only a few hundred rows. Or it
> could be that the on-screen display of each row is slower than the DB search. 
> I
> think it might become a problem, though.
> 
> I'm not sure how I would rewrite this to use FTS3 if it turns out to be to 
> slow
> for a larger DB, maybe a tokenizer that puts the first letter of  the name as
> the first letter of every suffix?
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Sam Roberts
On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess  wrote:
> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts  wrote:
>> FTS3 only searches full terms/words by default, but I think if I built a 
>> custom
>> tokenizer that returned all the suffix trees for a name:
>
> FTS3 can do prefix searches, MATCH 'a*'.  Also, it aimed to support

Prefix searches don't allow matching in the middle of words. For
example, I want  "bert"
to match my name, "roberts".

So, I think I'd need to tokenize roberts as "s", "ts", ..., "berts",
"oberts", ... etc.

Then do a prefix match for "bert*" in order to see that "roberts" matches.

Lucky, I don't need or care about any of the snippeting stuff, because
I'm matching short strings (names).

Cheers,
Sam
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Scott Hess
On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts  wrote:
> FTS3 only searches full terms/words by default, but I think if I built a 
> custom
> tokenizer that returned all the suffix trees for a name:

FTS3 can do prefix searches, MATCH 'a*'.  Also, it aimed to support
multiple hits at the same position, for stemming purposes.  So you
might be able to get away with making a copy of fts3_tokenizer1.c, and
modifying it to keep an additional flag in the cursor to let you
return each token twice (once reversed).

I can't offhand think of how to distinguish the resulting prefix
matches from suffix matches.  Maybe you can work that out yourself by
using the rows returned to figure it out.  Also note that this will
possibly interact poorly with the snippeting and offset functions.

As a short-term proof-of-concept hack, you could just have two tables.
 Insert your originals into one table, then take last_insert_rowid()
and insert the document reversed into the other table.

-scott
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Sam Roberts
On Fri, Aug 6, 2010 at 6:11 AM, Adam DeVita  wrote:
> A variant on Simon's plan.
> Are the 10,000 rows static, slowly changing, or frequently changing?

Never change, it's read-only.

>  Does
> it make sense to pre-calculate some counts at the time data is loaded?
>  Is
> this memory constrained so much that you can't afford 1 or 2 MB to let you
> look up based on ints? (I'm assuming that one letter is all you are after,
> either 'starts with' or 'contains' and not in order combinations.)

No, substrings, it's just that I then need a count of matching
substrings by first char.

Good idea, there are a number of other queries where pre-calculating
is linear in the space cost, but here the the usage is interactive
search, where as they type more of the name, it narrows down the
search results as people type in more.

Pre-calculating would be about 40 factorial in space, there are about
64000 3-character strings, and then once  they typed the 4th char in
it would be slow again. Of course, not all of those exist. Hm. Maybe
I'll try to precalculate the suffix tree, and see how many results
there really are, I don't need to store zero results.

The fastest I've found so far is using FTS3. Its a little slow, but
not unusably so. There are only 2500 rows now, I hope that it will
scale well as the DB increases in size. I'm still considering other
approaches, maybe a custom b-tree.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-06 Thread Adam DeVita
A variant on Simon's plan.
Are the 10,000 rows static, slowly changing, or frequently changing?   Does
it make sense to pre-calculate some counts at the time data is loaded?  Is
this memory constrained so much that you can't afford 1 or 2 MB to let you
look up based on ints? (I'm assuming that one letter is all you are after,
either 'starts with' or 'contains' and not in order combinations.)

Adam

On Thu, Aug 5, 2010 at 5:40 PM, Simon Slavin  wrote:

>
> On 5 Aug 2010, at 10:03pm, Sam Roberts wrote:
>
> > But do you think the section would make the counting faster? I think
> > I'd have to get the row counts like this, which would still do the
> > slow full table scan:
> >
> >  select section, count(*) from my_table where name like '%e%' group by
> section;
>
> But 'group by section' can profit from the index on the section column so
> it should be faster.
>
> As with all these things, the suggestion is to try it and see.  You should
> try six or seven different solutions including shuffling columns and indexes
> before you settle on the one that will be in your final code.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
VerifEye Technologies Inc.
905-948-0015x245
7100 Warden Ave, Unit 3
Markham ON, L3R 8B5
Canada
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-05 Thread Simon Slavin

On 5 Aug 2010, at 10:03pm, Sam Roberts wrote:

> But do you think the section would make the counting faster? I think
> I'd have to get the row counts like this, which would still do the
> slow full table scan:
> 
>  select section, count(*) from my_table where name like '%e%' group by 
> section;

But 'group by section' can profit from the index on the section column so it 
should be faster.

As with all these things, the suggestion is to try it and see.  You should try 
six or seven different solutions including shuffling columns and indexes before 
you settle on the one that will be in your final code.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-05 Thread Sam Roberts
On Thu, Aug 5, 2010 at 1:37 PM, Simon Slavin  wrote:
>
> On 5 Aug 2010, at 8:42pm, Sam Roberts wrote:
>
>> select substr(name,1,1), count(*) from my_table where name like '%e%'
>> group by substr(name,1,1);
>
> If you are constantly going to be using the first character of the name like 
> that, give it a column of its own with its own index.

That's a good idea. I think it would help a lot with row fetching if
section was it's own column:

  select * from my_table where section is "g" and name like "%e%"
order by name limit 1 offset 4;

But do you think the section would make the counting faster? I think
I'd have to get the row counts like this, which would still do the
slow full table scan:

  select section, count(*) from my_table where name like '%e%' group by section;

Cheers,
Sam
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-05 Thread Simon Slavin

On 5 Aug 2010, at 8:42pm, Sam Roberts wrote:

> select substr(name,1,1), count(*) from my_table where name like '%e%'
> group by substr(name,1,1);

If you are constantly going to be using the first character of the name like 
that, give it a column of its own with its own index.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?

2010-08-05 Thread Sam Roberts
I'd appreciate any suggestions on good ways to do this, I'm neither an SQL or
sqlite expert, so I might be thinking about it all wrong.

I have something like a (read-only) address book/rolodex, with interactive
searching. As users type into the search box, I need to first know for each
section how many rows match the substring typed so far.  I only display the
rows that are visible on screen.

I have two queries:

(A) I count the rows in a letter group.

If they typed "e":

select substr(name,1,1), count(*) from my_table where name like '%e%'
group by substr(name,1,1);
A|94
B|118
C|131
...

This is too slow, ~3sec, with 2500 rows, and we want to have 1 rows.

Worse, when they type "es", the search is as slow after they type "s" as when
they typed "e", even though the "es" rows are a sub-set of the rows that
matched "e".

FTS3 only searches full terms/words by default, but I think if I built a custom
tokenizer that returned all the suffix trees for a name:

"fu bar" => [ "r", "ar", "bar", " bar", "u bar", "fu bar"]

That I could do rewrite query (A) like this:

select substr(name,1,1), count(*) from my_table where name match 'e*'
group by substr(name,1,1);

Is this a reasonable approach? Is there a better way? Has somebody
else done this?



(B) I access specific rows within a letter group.

For visible rows, I fetch them by offset into a letter group, so row 4 in the
"g" section of names containing "e" would be:

select * from my_table where name like "g%" and name like "%e%" order
by name limit 1 offset 4;

The performance for this is OK, right now, I think it's because the first LIKE
can use the index, so the linear scan is over only a few hundred rows. Or it
could be that the on-screen display of each row is slower than the DB search. I
think it might become a problem, though.

I'm not sure how I would rewrite this to use FTS3 if it turns out to be to slow
for a larger DB, maybe a tokenizer that puts the first letter of  the name as
the first letter of every suffix?
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users