Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-12 Thread Tom Lane
Karan Sikka  writes:
>> Having said that, I've had a bee in my bonnet for a long time about
>> removing per-row setup cost for repetitive regex matches, and
>> whatever infrastructure that needs would work for this too.

> What are the per-row setup costs for regex matches?

Well, they're pretty darn high if you have more active regexps than will
fit in that cache, and even if you don't, the cache lookup seems a bit
inefficient.  What I'd really like to do is get rid of that cache in favor
of having a way to treat a precompiled regexp as a constant.  I think this
is probably possible via inventing a "regexp" datatype, which we make the
declared RHS input type for the ~ operator, and give it an implicit cast
from text so that existing queries don't break.  The compiled regexp tree
structure contains pointers so it could never go to disk, but now that
we have the "expanded datum" infrastructure you could imagine that the
on-disk representation is the same as text but we support adding a
compiled tree to it in-memory.

Or maybe we just need a smarter cache mechanism in regexp.c.  A cache
like that might be the only way to deal with a query using variable
patterns (e.g, pattern argument coming from a table column).  But it
seems like basically the wrong approach for the common case of a
constant pattern.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-12 Thread Karan Sikka
> Having said that, I've had a bee in my bonnet for a long time about
> removing per-row setup cost for repetitive regex matches, and
> whatever infrastructure that needs would work for this too.

What are the per-row setup costs for regex matches? I looked at
`regexp.c` and saw:

```
/*
 * We cache precompiled regular expressions using a "self organizing list"
 * structure, in which recently-used items tend to be near the front.
 * Whenever we use an entry, it's moved up to the front of the list.
 * Over time, an item's average position corresponds to its frequency of
use.
 *
```

What proverbial bee did you have in your bonnet about the current regex
implementation?
Which functions other than `strpos` and `LIKE` would benefit from a similar
cache, or perhaps a query-scoped cache?

In the mean time I'll look at other TODOs that catch my interest.
Feel free to point me in the direction of one that you think is
both desirable and easy enough for a beginner.

Thanks!
Karan


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Tom Lane
Karan Sikka  writes:
> Just for the record, I'm leaning to the side of "not worth it". My
> thoughts are, if you are at a scale where you care about this 20%
> speedup, you would want to go all the way to an indexed structure,
> because the gains you would realize would exceed 20%, and 20% may not be
> a sufficient speedup anyway.

If I'm reading your test case correctly, 20% is actually a rather
impressive number, because it's 20% *overall* gain on queries that will
also involve TOAST fetch and decompress on the source data.  (Decompress
definitely, and I'm guessing those 5K strings don't compress well enough
to avoid getting pushed out-of-line; though it might be worth repeating
the test with chunks of 10K or 20K to be sure.)  So the percentage
improvement in the LIKE test proper must have been a lot more than that.

However, I'm dubious that LIKE patterns with long fixed substrings are a
common use-case, so I'm afraid that this might be quite a lot of work for
something that won't much benefit most users.  I'm also worried that the
setup costs might be enough to make it a net loss in many cases.

There are probably ways to amortize the setup costs, since typical
scenarios involve the same LIKE pattern across many rows, but implementing
that would add even more work.  (Having said that, I've had a bee in my
bonnet for a long time about removing per-row setup cost for repetitive
regex matches, and whatever infrastructure that needs would work for this
too.  And for strpos' B-M-H setup, looks like.  So this might be something
to look into with a suitably wide view of what the problem is.)

Not sure what advice to give you here.  I think this is in the grey zone
where it's hard to be sure whether it's worth putting work into.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Robert Haas
On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrera  wrote:
>> > How desirable is this feature? To begin answering that question,
>> > I did some initial benchmarking with an English text corpus to find how 
>> > much
>> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the 
>> > results
>> > were as follows: BMH can be up to 20% faster than LIKE, but for short
>> > substrings, it's roughly comparable or slower.
>> >
>> > Here are the results visualized:
>> >
>> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
>> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
>>
>> Based on these results, this looks to me like a pretty unexciting
>> thing upon which to spend time.
>
> Uh, a 20% different is "unexciting" for you?  I think it's interesting.
> Now, really, users shouldn't be running LIKE on constant strings all the
> time but rather use some sort of indexed search, but once in a while
> there is a need to run some custom query and you need to LIKE-scan a
> large portion of a table.  For those cases an algorithm that performs
> 20% better is surely welcome.

Sure, but an algorithm that performs 20% faster in the best case and
worse in some other cases is not the same thing as a 20%
across-the-board performance improvement.  I guess if we had a way of
deciding which algorithm to use in particular cases it might make
sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Karan Sikka
Yeah, you make a fair point.

I was hoping more people would chime in with strong opinions to make the
decision easier, but perhaps the lack of noise indicates this is a low-pri
feature.

I will ask around in my coworker circles to see what they think,
could you do the same?

Just for the record, I'm leaning to the side of "not worth it". My thoughts
are,
if you are at a scale where you care about this 20% speedup, you would want
 to go all the way to an indexed structure, because the gains you would
realize
would exceed 20%, and 20% may not be a sufficient speedup anyway.

On Tue, Aug 2, 2016 at 1:56 PM, Alvaro Herrera 
wrote:

> Robert Haas wrote:
> > On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka 
> wrote:
> > > Following the patch to implement strpos with Boyer-Moore-Horspool,
> > > it was proposed we bring BMH to LIKE as well.
> > >
> > > Original thread:
> > >
> https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
> > >
> > > I'm a first time hacker and I found this task on the TODO list. It
> seemed
> > > interesting and isolated enough. But after looking at the code in
> > > like_match.c, it's clearly a non-trivial task.
> > >
> > > How desirable is this feature? To begin answering that question,
> > > I did some initial benchmarking with an English text corpus to find
> how much
> > > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the
> results
> > > were as follows: BMH can be up to 20% faster than LIKE, but for short
> > > substrings, it's roughly comparable or slower.
> > >
> > > Here are the results visualized:
> > >
> > > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> > > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
> >
> > Based on these results, this looks to me like a pretty unexciting
> > thing upon which to spend time.
>
> Uh, a 20% different is "unexciting" for you?  I think it's interesting.
> Now, really, users shouldn't be running LIKE on constant strings all the
> time but rather use some sort of indexed search, but once in a while
> there is a need to run some custom query and you need to LIKE-scan a
> large portion of a table.  For those cases an algorithm that performs
> 20% better is surely welcome.
>
> I wouldn't be so quick to dismiss this.
>
> Of course, it needs to work in all cases, or failing that, be able to
> fall back to the original code if it cannot support some corner case.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-02 Thread Alvaro Herrera
Robert Haas wrote:
> On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka  wrote:
> > Following the patch to implement strpos with Boyer-Moore-Horspool,
> > it was proposed we bring BMH to LIKE as well.
> >
> > Original thread:
> > https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
> >
> > I'm a first time hacker and I found this task on the TODO list. It seemed
> > interesting and isolated enough. But after looking at the code in
> > like_match.c, it's clearly a non-trivial task.
> >
> > How desirable is this feature? To begin answering that question,
> > I did some initial benchmarking with an English text corpus to find how much
> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results
> > were as follows: BMH can be up to 20% faster than LIKE, but for short
> > substrings, it's roughly comparable or slower.
> >
> > Here are the results visualized:
> >
> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
> 
> Based on these results, this looks to me like a pretty unexciting
> thing upon which to spend time.

Uh, a 20% different is "unexciting" for you?  I think it's interesting.
Now, really, users shouldn't be running LIKE on constant strings all the
time but rather use some sort of indexed search, but once in a while
there is a need to run some custom query and you need to LIKE-scan a
large portion of a table.  For those cases an algorithm that performs
20% better is surely welcome.

I wouldn't be so quick to dismiss this.

Of course, it needs to work in all cases, or failing that, be able to
fall back to the original code if it cannot support some corner case.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-01 Thread Robert Haas
On Mon, Aug 1, 2016 at 7:47 PM, Karan Sikka  wrote:
> I agree, should we remove it from the TODO list?

If nobody objects, sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-01 Thread Karan Sikka
I agree, should we remove it from the TODO list?

On Mon, Aug 1, 2016 at 6:13 PM, Robert Haas  wrote:

> On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka  wrote:
> > Following the patch to implement strpos with Boyer-Moore-Horspool,
> > it was proposed we bring BMH to LIKE as well.
> >
> > Original thread:
> >
> https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
> >
> > I'm a first time hacker and I found this task on the TODO list. It seemed
> > interesting and isolated enough. But after looking at the code in
> > like_match.c, it's clearly a non-trivial task.
> >
> > How desirable is this feature? To begin answering that question,
> > I did some initial benchmarking with an English text corpus to find how
> much
> > faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the
> results
> > were as follows: BMH can be up to 20% faster than LIKE, but for short
> > substrings, it's roughly comparable or slower.
> >
> > Here are the results visualized:
> >
> > http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> > http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png
>
> Based on these results, this looks to me like a pretty unexciting
> thing upon which to spend time.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] TODO item: Implement Boyer-Moore searching in LIKE queries

2016-08-01 Thread Robert Haas
On Mon, Aug 1, 2016 at 1:19 PM, Karan Sikka  wrote:
> Following the patch to implement strpos with Boyer-Moore-Horspool,
> it was proposed we bring BMH to LIKE as well.
>
> Original thread:
> https://www.postgresql.org/message-id/flat/27645.1220635769%40sss.pgh.pa.us#27645.1220635...@sss.pgh.pa.us
>
> I'm a first time hacker and I found this task on the TODO list. It seemed
> interesting and isolated enough. But after looking at the code in
> like_match.c, it's clearly a non-trivial task.
>
> How desirable is this feature? To begin answering that question,
> I did some initial benchmarking with an English text corpus to find how much
> faster BMH (Boyer-Moore-Horspool) would be compared to LIKE, and the results
> were as follows: BMH can be up to 20% faster than LIKE, but for short
> substrings, it's roughly comparable or slower.
>
> Here are the results visualized:
>
> http://ctrl-c.club/~ksikka/pg/like-strpos-short-1469975400.png
> http://ctrl-c.club/~ksikka/pg/like-strpos-long-1469975400.png

Based on these results, this looks to me like a pretty unexciting
thing upon which to spend time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers