Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-30 Thread Tom Lane
I wrote:
> So what I'm thinking we ought to do is redefine things so that
> initGISTstate sets fn_mcxt to a context that has the same lifespan as
> the GISTSTATE itself does.  We could possibly eliminate a few retail
> pfree's in the process, eg by keeping the GISTSTATE itself in that same
> context.
> Having done that, what gtrgm_consistent is doing would be an officially
> supported (dare I suggest even documented?) thing instead of a kluge,
> and we could then adopt the same methodology in gtrgm_penalty.

I've committed patches along this line.  On my machine, the time needed
to do the test case Heikki proposed (build a gist_trgm_ops index on the
contents of /usr/share/dict/words) drops from around 29.9 seconds to
about 17.3.  makesign is now down in the noise according to oprofile,
so I see no further reason to pursue the patch that started this thread.

What's not in the noise is the memcmp() required to check if the cached
makesign result is still good --- as best I can tell, that's near 50%
of the remaining runtime.  I don't see any non-invasive way to avoid
that.  If we were willing to redo the GiST support function API, we
could fix it, but I'm not sure it's worth that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-30 Thread Tom Lane
I wrote:
> Alexander Korotkov  writes:
>> Isn't it possible to cache signature of newitem in gtrgm_penalty
>> like gtrgm_consistent do this for query?

> [ studies that code for awhile ... ]  Ick, what a kluge.

> The main problem with that code is that the cache data gets leaked at
> the conclusion of a scan.  Having just seen the consequences of leaking
> the "giststate", I think this is something we need to fix not emulate.

> I wonder whether it's worth having the GIST code create a special
> scan-lifespan (or insert-lifespan) memory context that could be used
> for cached data such as this?  It's already creating a couple of
> contexts for its own purposes, so one more might not be a big problem.
> We'd have to figure out a way to make that context available to GIST
> support functions, though, as well as something cleaner than fn_extra
> for them to keep pointers in.

I've been chewing on this for awhile and am now thinking that maybe what
gtrgm_consistent is doing isn't that unreasonable.  It's certainly
legitimate for it to use fn_extra to maintain state between calls.
The problem fundamentally is that when a function uses fn_extra to
reference data it keeps in the fn_mcxt context, there's an implicit
contract that fn_extra will survive for the same length of time that the
fn_mcxt context does.  (Otherwise there's no way for the function to
avoid leaking that data after it's been called for the last time using
that FmgrInfo.)  And GIST is violating that assumption: it resets
fn_extra when it does initGISTstate, but fn_mcxt gets set to
CurrentMemoryContext, which in the problematic cases is a query-lifespan
context that will still be around after the GIST indexscan is concluded.

So what I'm thinking we ought to do is redefine things so that
initGISTstate sets fn_mcxt to a context that has the same lifespan as
the GISTSTATE itself does.  We could possibly eliminate a few retail
pfree's in the process, eg by keeping the GISTSTATE itself in that same
context.

Having done that, what gtrgm_consistent is doing would be an officially
supported (dare I suggest even documented?) thing instead of a kluge,
and we could then adopt the same methodology in gtrgm_penalty.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Tom Lane
Alexander Korotkov  writes:
> On Fri, Sep 30, 2011 at 1:08 AM, Heikki Linnakangas <
> heikki.linnakan...@enterprisedb.com> wrote:
>> At every call, gtrgm_penalty() has to calculate the signature for newitem,
>> using makesign(). That's an enormous waste of effort, but there's currently
>> no way gtrgm_penalty() to avoid that. If we could call makesign() only on
>> the first call in the loop, and remember it for the subsequent calls, that
>> would eliminate the need for any micro-optimization in makesign() and make
>> inserting into a trigram index much faster (including building the index
>> from scratch)

> Isn't it possible to cache signature of newitem in gtrgm_penalty
> like gtrgm_consistent do this for query?

[ studies that code for awhile ... ]  Ick, what a kluge.

The main problem with that code is that the cache data gets leaked at
the conclusion of a scan.  Having just seen the consequences of leaking
the "giststate", I think this is something we need to fix not emulate.

I wonder whether it's worth having the GIST code create a special
scan-lifespan (or insert-lifespan) memory context that could be used
for cached data such as this?  It's already creating a couple of
contexts for its own purposes, so one more might not be a big problem.
We'd have to figure out a way to make that context available to GIST
support functions, though, as well as something cleaner than fn_extra
for them to keep pointers in.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Alexander Korotkov
On Fri, Sep 30, 2011 at 1:16 AM, Tom Lane  wrote:

> Hmm.  Are there any other datatypes for which the penalty function has
> to duplicate effort?  I'm disinclined to fool with this if pg_trgm is
> the only example ... but if it's not, maybe we should do something
> about that instead of micro-optimizing makesign.
>
GiST for tsearch works similarly.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Alexander Korotkov
On Fri, Sep 30, 2011 at 1:08 AM, Heikki Linnakangas <
heikki.linnakan...@enterprisedb.com> wrote:

> At every call, gtrgm_penalty() has to calculate the signature for newitem,
> using makesign(). That's an enormous waste of effort, but there's currently
> no way gtrgm_penalty() to avoid that. If we could call makesign() only on
> the first call in the loop, and remember it for the subsequent calls, that
> would eliminate the need for any micro-optimization in makesign() and make
> inserting into a trigram index much faster (including building the index
> from scratch)
>
Isn't it possible to cache signature of newitem in gtrgm_penalty
like gtrgm_consistent do this for query?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Tom Lane
Heikki Linnakangas  writes:
> Looking at the big picture, however, the real problem with all those 
> makesign() calls is that they happen in the first place. They happen 
> when gist needs to choose which child page to place a new tuple on. It 
> calls the penalty for every item on the internal page, always passing 
> the new key as the 2nd argument, along the lines of:

> for (all items on internal page)
>penalty(item[i], newitem);

> At every call, gtrgm_penalty() has to calculate the signature for 
> newitem, using makesign(). That's an enormous waste of effort, but 
> there's currently no way gtrgm_penalty() to avoid that.

Hmm.  Are there any other datatypes for which the penalty function has
to duplicate effort?  I'm disinclined to fool with this if pg_trgm is
the only example ... but if it's not, maybe we should do something
about that instead of micro-optimizing makesign.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Heikki Linnakangas

On 29.09.2011 20:27, Kevin Grittner wrote:

Heikki's second version, a more radical revision optimized for 64
bit systems, blows up on a 32 bit compile, writing off the end of
the structure.  Personally, I'd be OK with sacrificing some
performance for 32 bit systems to get better performance on 64 bit
systems, since people who care about performance generally seem to
be on 64 bit builds these days -- but it has to run.  Given Tom's
reservations about this approach, I don't know whether Heikki is
interested in fixing the crash so it can be benchmarked.  Heikki?


No, I'm not going to work on that 64-bit patch.

Looking at the big picture, however, the real problem with all those 
makesign() calls is that they happen in the first place. They happen 
when gist needs to choose which child page to place a new tuple on. It 
calls the penalty for every item on the internal page, always passing 
the new key as the 2nd argument, along the lines of:


for (all items on internal page)
  penalty(item[i], newitem);

At every call, gtrgm_penalty() has to calculate the signature for 
newitem, using makesign(). That's an enormous waste of effort, but 
there's currently no way gtrgm_penalty() to avoid that. If we could call 
makesign() only on the first call in the loop, and remember it for the 
subsequent calls, that would eliminate the need for any 
micro-optimization in makesign() and make inserting into a trigram index 
much faster (including building the index from scratch).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-29 Thread Kevin Grittner
"Kevin Grittner"  wrote:
>>  Tom Lane  wrote:
 
>> Hmm, why is that patch the one posted for review, when several
>> better versions were already discussed? See thread starting here:
>> http://archives.postgresql.org/pgsql-hackers/2011-07/msg00028.php
>  
> The patch I reviewed was added to the CF app before the post you
> cite was written.  I didn't see it in following the links
> (probably because it crossed a month boundary).  Thanks for
> pointing that out; I'll update the CF app and review the later
> versions.
 
The first patch Tom posted was a clear improvement on Heikki's
original patch, and performed slightly better.
 
The second patch Tom posted makes the patch more robust in the face
of changes to the structure, but reduces the performance benefit on
the dictionary, and performs very close to the unpatched version for
samples of English text.  If anything, it seems to be slower with
real text, but the difference is so small it might be noise.  (The
dictionary tests are skewed toward longer average word length than
typically found in actual readable text.)  I wonder whether the code
for the leading, unaligned portion of the data could be written such
that it was effectively unrolled and the length resolved at compile
time rather than figured out at run time for every call to the
function.  That would solve the robustness issue without hurting
performance.  If we don't do something like that, this patch doesn't
seem worth applying.
 
Heikki's second version, a more radical revision optimized for 64
bit systems, blows up on a 32 bit compile, writing off the end of
the structure.  Personally, I'd be OK with sacrificing some
performance for 32 bit systems to get better performance on 64 bit
systems, since people who care about performance generally seem to
be on 64 bit builds these days -- but it has to run.  Given Tom's
reservations about this approach, I don't know whether Heikki is
interested in fixing the crash so it can be benchmarked.  Heikki?
 
I will do a set of more carefully controlled performance tests on
whatever versions are still in the running after we sort out the
issues above.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-25 Thread Kevin Grittner
> Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> This is a review of the patch at this CF location:
>> https://commitfest.postgresql.org/action/patch_view?id=598
>> as posted here:
>>
http://archives.postgresql.org/message-id/4e04c099.3020...@enterprisedb.com
>  
> Hmm, why is that patch the one posted for review, when several
> better versions were already discussed? See thread starting here:
> http://archives.postgresql.org/pgsql-hackers/2011-07/msg00028.php
 
The patch I reviewed was added to the CF app before the post you cite
was written.  I didn't see it in following the links (probably
because it crossed a month boundary).  Thanks for pointing that out;
I'll update the CF app and review the later versions.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-25 Thread Tom Lane
"Kevin Grittner"  writes:
> This is a review of the patch at this CF location:
> https://commitfest.postgresql.org/action/patch_view?id=598
> as posted here:
> http://archives.postgresql.org/message-id/4e04c099.3020...@enterprisedb.com

Hmm, why is that patch the one posted for review, when several better
versions were already discussed?  See thread starting here:
http://archives.postgresql.org/pgsql-hackers/2011-07/msg00028.php

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Optimizing pg_trgm makesign() (was Re: WIP: Fast GiST index build)

2011-09-25 Thread Kevin Grittner
This is a review of the patch at this CF location:
 
https://commitfest.postgresql.org/action/patch_view?id=598
 
as posted here:
 
http://archives.postgresql.org/message-id/4e04c099.3020...@enterprisedb.com
 
This patch applied cleanly and compiled without warning.  It
performed correctly.  Since the patch modifies one function which has
one input and one output (through a pointer parameter), with no other
side effects, it has low risk of surprising problems.
 
It is strictly a performance patch, and accomplishes its goals.  I
ran the entire dictionary through the patched and unpatched versions
five times each (using the supplied debugging patch) and saw no
changes to the behavior of the function -- identical results every
time.  I ran performance tests with building and reindexing the
sorted dictionary, a randomly ordered dictionary, a randomly ordered
dictionary with the entries randomly truncated to 0 to 3 characters,
and a large README file -- five times each with and without the
patch.  The patch was a clear winner with all of those except the
truncated dictionary, where the difference we well within the noise
(0.05% difference summing five runs when individual runs could vary
by up to about 4%).
 
While there was no reason to believe it could affect search
performance, I tested that anyway, and found no difference.
 
Once this message shows up on the list (so I can retrieve the message
ID) I will mark this "Ready for Committer".
 
-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers