Re: [PATCHES] strpos() && KMP

2007-09-26 Thread Bruce Momjian

Added to TODO:

> * Implement Boyer-Moore searching in strpos()
>
>   http://archives.postgresql.org/pgsql-patches/2007-08/msg00012.php


---

Pavel Ajtkulov wrote:
> Hello,
> 
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function 
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).
> 
> It also works with multibyte wchar.
> 
> In worst case current brute force strpos() takes O(n * m) (n && m is length 
> of strings) 
> time (example: 'aaa...aaab' search in 'aaa...aaa').
> KMP algo always takes O(n + m) time. 
> To check this someone need to create a table with one text attribute, and 
> insert several thousands
> record 'aa..aa'(for example, with lenght = 1000) . After execute "select 
> count(*) from test where 
> strpos(a, 'aaaaab') > 0;" on current and modified version.
> 
> Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead 
> "select .. where attr like '%word%'"
> (strpos must be faster than regex).
> 
> In general, this belongs to artificial expressions. In natural language KMP 
> is equal (execution time)
> current strpos() nearly.
> 
> 
> 
> Ajtkulov Pavel
> [EMAIL PROTECTED]
> 
> P. S. Sorry for prime English.

[ Attachment, skipping... ]

> 
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] strpos() && KMP

2007-08-10 Thread Tom Lane
Pavel Ajtkulov <[EMAIL PROTECTED]> writes:
> Tom Lane writes:
>> Moreover, you'd lose the guarantee of not-worse-than-linear time,
>> because hash lookup can be pathologically bad if you get a lot of hash
>> collisions.

> compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for
> example, k = 1000), then we use index table (wchar -> wchar -
> min_wchar). Else we use hash table. Number of collisions would be a
> few (because hash table needs for pattern characters only.

I think you missed my point: there's a significant difference between
"guaranteed good performance" and "probabilistically good performance".
Even when the probably-good algorithm wins for typical cases, there's a
strong argument to be made for guarantees.  The problem you set out to
solve really is that an algorithm that's all right in everyday cases
will suck in certain uncommon cases --- so why do you want to fix it
by just moving around the cases in which it fails to do well?

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] strpos() && KMP

2007-08-10 Thread Pavel Ajtkulov
Tom Lane writes:

>> hash table?

> I'd think the cost of hashing would render it impractical.  Most of the
> papers I've seen on this topic worry about getting single instructions
> out of the search loop --- a hash lookup will cost lots more than that.
> Moreover, you'd lose the guarantee of not-worse-than-linear time,
> because hash lookup can be pathologically bad if you get a lot of hash
> collisions.

compute max_wchar, min_wchar. If (d = max_wchar - min_wchar) < k (for
example, k = 1000), then we use index table (wchar -> wchar -
min_wchar). Else we use hash table. Number of collisions would be a
few (because hash table needs for pattern characters only. Characters
located serially, hash function = whchar % const).

>> The main difficulty with BM is verification and understanding "good
>> suffix shift" (the second part of BM) (I don't understand it entirely).

> Yeah, there seem to be a bunch of variants of BM (many of them not
> guaranteed linear, which I'm sure we don't want) and the earliest
> papers had bugs.  But a good implementation would be a lot easier
> sell because it would show benefits for a much wider set of use-cases
> than KMP.

Is there requirement for some string mathching algorithms/data
structure(suffix array/tree) in PG? or "We've
had no complaints about the speed of those functions".



Ajtkulov Pavel
[EMAIL PROTECTED]



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] strpos() && KMP

2007-08-08 Thread Tom Lane
Pavel Ajtkulov <[EMAIL PROTECTED]> writes:
> Tom Lane writes:
>> The difficulty with B-M is the need for a table indexed by character
>> code, which at first glance looks impractical for wchars.  But it seems
>> to me that we could use "wchar % 256" as the table index, meaning that
>> wchars with the same trailing byte share the same table entry.

> hash table?

I'd think the cost of hashing would render it impractical.  Most of the
papers I've seen on this topic worry about getting single instructions
out of the search loop --- a hash lookup will cost lots more than that.
Moreover, you'd lose the guarantee of not-worse-than-linear time,
because hash lookup can be pathologically bad if you get a lot of hash
collisions.

> The main difficulty with BM is verification and understanding "good
> suffix shift" (the second part of BM) (I don't understand it entirely).

Yeah, there seem to be a bunch of variants of BM (many of them not
guaranteed linear, which I'm sure we don't want) and the earliest
papers had bugs.  But a good implementation would be a lot easier
sell because it would show benefits for a much wider set of use-cases
than KMP.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] strpos() && KMP

2007-08-08 Thread Pavel Ajtkulov
Tom Lane writes:

> I wonder why you didn't propose Boyer-Moore instead, as that would have
> some advantage for natural language text as well.

> The difficulty with B-M is the need for a table indexed by character
> code, which at first glance looks impractical for wchars.  But it seems
> to me that we could use "wchar % 256" as the table index, meaning that
> wchars with the same trailing byte share the same table entry.  That
> would lose some efficiency compared to an exact implementation, but the
> limited table size would outweigh that except in the most pathological
> cases.

hash table?

The main difficulty with BM is verification and understanding "good
suffix shift" (the second part of BM) (I don't understand it entirely).


Ajtkulov Pavel
[EMAIL PROTECTED]



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] strpos() && KMP

2007-08-07 Thread Tom Lane
Pavel Ajtkulov <[EMAIL PROTECTED]> writes:
> Patch intends for artificial language (for example DNA, or
> language with small alphabet, or regular language) only.
> In natural language, KMP(and other search algo) haven't notable
> advantages (+-5% time execution).

I wonder why you didn't propose Boyer-Moore instead, as that would have
some advantage for natural language text as well.

The difficulty with B-M is the need for a table indexed by character
code, which at first glance looks impractical for wchars.  But it seems
to me that we could use "wchar % 256" as the table index, meaning that
wchars with the same trailing byte share the same table entry.  That
would lose some efficiency compared to an exact implementation, but the
limited table size would outweigh that except in the most pathological
cases.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] strpos() && KMP

2007-08-07 Thread Decibel!
On Wed, Aug 08, 2007 at 12:19:53AM +0500, Pavel Ajtkulov wrote:
> 
> > Do you have any performance test results for this?
> 
> I describe the worst case in first message (search 'aaa..aab' in
> 'aa..aa', "complete" N^2). It works some msec instead of several sec
> (in current version).

You describe the test case but don't provide any results. Providing the
test case so others can duplicate your results is good, but you should
provide actual results as well. You should also make sure to test any
cases where performance would actually degrade so we can see what that
looks like (though I don't know if that's an issue in this case).
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgpGxMofKAc20.pgp
Description: PGP signature


Re: [PATCHES] strpos() && KMP

2007-08-07 Thread Pavel Ajtkulov

> Do you have any performance test results for this?

I describe the worst case in first message (search 'aaa..aab' in
'aa..aa', "complete" N^2). It works some msec instead of several sec
(in current version).

Patch intends for artificial language (for example DNA, or
language with small alphabet, or regular language) only.

In natural language, KMP(and other search algo) haven't notable
advantages (+-5% time execution).


Ajtkulov Pavel
[EMAIL PROTECTED]



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] strpos() && KMP

2007-08-03 Thread Decibel!
On Thu, Aug 02, 2007 at 01:18:22AM +0500, Pavel Ajtkulov wrote:
> Hello,
> 
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function 
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).
> 
> It also works with multibyte wchar.
> 
> In worst case current brute force strpos() takes O(n * m) (n && m is length 
> of strings) 
> time (example: 'aaa...aaab' search in 'aaa...aaa').
> KMP algo always takes O(n + m) time. 
> To check this someone need to create a table with one text attribute, and 
> insert several thousands
> record 'aa..aa'(for example, with lenght = 1000) . After execute "select 
> count(*) from test where 
> strpos(a, 'aaaaab') > 0;" on current and modified version.
> 
> Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead 
> "select .. where attr like '%word%'"
> (strpos must be faster than regex).
> 
> In general, this belongs to artificial expressions. In natural language KMP 
> is equal (execution time)
> current strpos() nearly.

Do you have any performance test results for this?
-- 
Decibel!, aka Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)


pgprzhuOBAr5D.pgp
Description: PGP signature


Re: [PATCHES] strpos() && KMP

2007-08-01 Thread Tom Lane
Pavel Ajtkulov <[EMAIL PROTECTED]> writes:
> this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function 
> (see Cormen et al. Introduction to Algorithms, MIT Press, 2001).

This seems like a lot of complexity added to fix a non-problem.  We've
had no complaints about the speed of those functions (at least not since
we changed the original O(N^2) method :-().

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] strpos() && KMP

2007-08-01 Thread Andrew Dunstan



Pavel Ajtkulov wrote:

Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead "select .. 
where attr like '%word%'"
(strpos must be faster than regex).

  


The LIKE code does not use the regex engine. See 
src/backend/utils/adt/like.c and like_match.c - which recently got an 
efficiency makeover, especially for MB charsets (and most especially for 
UTF8).


cheers

andrew

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PATCHES] strpos() && KMP

2007-08-01 Thread Pavel Ajtkulov
Hello,

this patch allow to use Knuth-Morrison-Pratt algorithm for strpos() function 
(see Cormen et al. Introduction to Algorithms, MIT Press, 2001).

It also works with multibyte wchar.

In worst case current brute force strpos() takes O(n * m) (n && m is length of 
strings) 
time (example: 'aaa...aaab' search in 'aaa...aaa').
KMP algo always takes O(n + m) time. 
To check this someone need to create a table with one text attribute, and 
insert several thousands
record 'aa..aa'(for example, with lenght = 1000) . After execute "select 
count(*) from test where 
strpos(a, 'aaaaab') > 0;" on current and modified version.

Also, I advise to use "select .. where strpos(att, 'word') > 0;" instead 
"select .. where attr like '%word%'"
(strpos must be faster than regex).

In general, this belongs to artificial expressions. In natural language KMP is 
equal (execution time)
current strpos() nearly.



Ajtkulov Pavel
[EMAIL PROTECTED]

P. S. Sorry for prime English.


varlena.c.patch
Description: Binary data

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings