Re: [HACKERS] gincostestimate and hypothetical indexes

2015-12-01 Thread Tom Lane
Julien Rouhaud  writes:
> On 01/12/2015 00:37, Tom Lane wrote:
>> Maybe we could do something along the lines of pretending that 90% of the
>> index size given by the plugin is entry pages?  Don't know what a good
>> ratio would be exactly, but we could probably come up with one with a bit
>> of testing.

> I used zero values because gincostestimate already handle empty
> statistics, and pretend that 100% of the pages are entry pages:

Yeah, but that code is pretty bogus.  It was never intended to do more
than minimally hold the fort until someone had vacuumed.  If we're trying
to support hypothetical-index plugins with this code, it should try to
do something a bit more realistic.

I did a bit of investigation using some sample data I had laying around
(JSONB and TSVECTOR data).  It looks like assuming that entry pages are
90% of the index is not too awful; I saw actual values ranging from
80% to 94%.  The real weak spot of the current code, however, is

numEntries = numTuples; /* bogus, but no other info available */

which is just as bogus as it says, because numTuples is going to be the
heap tuple count not anything specific to GIN.  Often you'd expect the
number of entries to be some multiple of the number of tuples, because
the whole point of GIN is to be able to index components of the indexed
column's values.  But on the other hand if there are not a lot of
distinct component values, you could get many fewer entries than tuples.

Based on what I saw in this small sample, I'm inclined to propose setting
numEntries to 100 times numEntryPages, that is, assume 100 entries per
entry page.  That could easily be off by a factor of 2, but it seems more
robust than just blindly using the heap tuple count.

(Of course, all of this is predicated on the assumption that the
hypothetical-index plugin gives us some realistic value for the index's
size in pages, but if it doesn't it's the plugin's fault.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] gincostestimate and hypothetical indexes

2015-11-30 Thread Julien Rouhaud
Hello,

I figured out that it's not possible to use a hypothetical gin index, as
the gincostestimate function try to retrieve some statistical data from
the index meta page.

Attached patch fixes this. I believe this should be back-patched as was
a2095f7fb5a57ea1794f25d029756d9a140fd429.

Regards.
-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 37fad86..24ffa3a 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -101,6 +101,7 @@
 #include 
 
 #include "access/gin.h"
+#include "access/gin_private.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/index.h"
@@ -7260,11 +7261,25 @@ gincostestimate(PG_FUNCTION_ARGS)
 	qinfos = deconstruct_indexquals(path);
 
 	/*
-	 * Obtain statistic information from the meta page
+	 * Obtain statistic information from the meta page if the index is not
+	 * hypothetical. Otherwise set all the counters to 0, as it would be for an
+	 * index that never got VACUUMed.
 	 */
-	indexRel = index_open(index->indexoid, AccessShareLock);
-	ginGetStats(indexRel, );
-	index_close(indexRel, AccessShareLock);
+	if (!index->hypothetical)
+	{
+		indexRel = index_open(index->indexoid, AccessShareLock);
+		ginGetStats(indexRel, );
+		index_close(indexRel, AccessShareLock);
+	}
+	else
+	{
+		ginStats.nPendingPages = 0;
+		ginStats.nTotalPages = 0;
+		ginStats.nEntryPages = 0;
+		ginStats.nDataPages = 0;
+		ginStats.nEntries = 0;
+		ginStats.ginVersion = GIN_CURRENT_VERSION;
+	}
 
 	numEntryPages = ginStats.nEntryPages;
 	numDataPages = ginStats.nDataPages;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate and hypothetical indexes

2015-11-30 Thread Tom Lane
Julien Rouhaud  writes:
> I figured out that it's not possible to use a hypothetical gin index, as
> the gincostestimate function try to retrieve some statistical data from
> the index meta page.

Good point.

> Attached patch fixes this. I believe this should be back-patched as was
> a2095f7fb5a57ea1794f25d029756d9a140fd429.

I don't much care for this patch though.  The core problem is that just
returning all zeroes seems quite useless: it will probably result in silly
cost estimates.  The comment in the patch claiming that this would be the
situation in a never-vacuumed index is wrong, because ginbuild() updates
those stats too.  But I'm not sure exactly what to do instead :-(.

Ideally we'd put it on the head of the hypothetical-index plugin to invent
some numbers, but I dunno if we want to create such an API or not ... and
we certainly couldn't back-patch such a change.

Maybe we could do something along the lines of pretending that 90% of the
index size given by the plugin is entry pages?  Don't know what a good
ratio would be exactly, but we could probably come up with one with a bit
of testing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate and hypothetical indexes

2015-11-30 Thread Julien Rouhaud
On 01/12/2015 00:37, Tom Lane wrote:
> Julien Rouhaud  writes:
>> I figured out that it's not possible to use a hypothetical gin index, as
>> the gincostestimate function try to retrieve some statistical data from
>> the index meta page.
> 
> Good point.
> 
>> Attached patch fixes this. I believe this should be back-patched as was
>> a2095f7fb5a57ea1794f25d029756d9a140fd429.
> 
> I don't much care for this patch though.  The core problem is that just
> returning all zeroes seems quite useless: it will probably result in silly
> cost estimates.  The comment in the patch claiming that this would be the
> situation in a never-vacuumed index is wrong, because ginbuild() updates
> those stats too.  But I'm not sure exactly what to do instead :-(.
> 

Oops, it looks that this is only true for pre 9.1 indexes (per comment
shown below).

> Ideally we'd put it on the head of the hypothetical-index plugin to invent
> some numbers, but I dunno if we want to create such an API or not ... and
> we certainly couldn't back-patch such a change.
> 
> Maybe we could do something along the lines of pretending that 90% of the
> index size given by the plugin is entry pages?  Don't know what a good
> ratio would be exactly, but we could probably come up with one with a bit
> of testing.
> 

I used zero values because gincostestimate already handle empty
statistics, and pretend that 100% of the pages are entry pages:


/*
* nPendingPages can be trusted, but the other fields are as of the last
* VACUUM. Scale them by the ratio numPages / nTotalPages to account for
* growth since then. If the fields are zero (implying no VACUUM at all,
* and an index created pre-9.1), assume all pages are entry pages.
*/
if (ginStats.nTotalPages == 0 || ginStats.nEntryPages == 0)
{
numEntryPages = numPages;
numDataPages = 0;
numEntries = numTuples; /* bogus, but no other info available */
}


But I don't have any clue of what would be a better ratio either.

>   regards, tom lane
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2011-04-21 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 The problem is that numEntries in src/backend/utils/adt/selfuncs.c is
 zero and eventually causes a division by zero and a cost estimate of
 nan.
 ...
 I don't know what the solution is.  Simply setting numEntries to 1 if
 ginStats.nEntries zero solves this particular problem, but I don't
 know what other consequences it might have.

Yeah, I think clamping numEntries to a minimum value of 1 is a
reasonable thing to do.  Will fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] gincostestimate

2011-04-08 Thread Jeff Janes
Dear Hackers,

A gin index created on an initially empty table will never get used
until the table is vacuumed, which for a table with no update or
delete activity could be forever unless someone manually intervenes.


The problem is that numEntries in src/backend/utils/adt/selfuncs.c is
zero and eventually causes a division by zero and a cost estimate of
nan.

The code below does not save the day, because nTotalPages and
nEntryPages are 2 and 1 respectively when an index is created on an
empty table, or when an indexed table is truncated.

if (ginStats.nTotalPages == 0 || ginStats.nEntryPages == 0)
{
numEntryPages = numPages;
numDataPages = 0;
numEntries = numTuples; /* bogus, but no other info available */
}

I don't know what the solution is.  Simply setting numEntries to 1 if
ginStats.nEntries zero solves this particular problem, but I don't
know what other consequences it might have.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-10-17 Thread Tom Lane
Itagaki Takahiro itagaki.takah...@gmail.com writes:
 On Wed, Sep 8, 2010 at 1:02 AM, Teodor Sigaev teo...@sigaev.ru wrote:
 Fixed, and slightly reworked to be more clear.
 Attached patch is based on your patch.

 The stats page is updated only in VACUUM. ANALYZE doesn't update
 the information at all.

ANALYZE doesn't scan indexes, so it's not realistic to expect ANALYZE to
update these numbers.

 In addition, REINDEX, VACUUM FULL, and
 CLUSTER reset the information to zero, but the reset is not preferable.
 Is it possible to fill the statistic fields at bulk index-build?

I fixed that and committed it.  It probably wouldn't be a bad idea for
Teodor or Oleg to double-check where I put in the counter increments;
although I did test that they matched the results of VACUUM for a
reasonably large GIN index.

 We don't have any methods to dump the meta information at all.
 They might be internal information, but some developers and
 debuggers might want such kinds of tools. Contrib/pageinspect
 might be a good location to have such function; it has bt_metap().

That seems like a good idea, but I haven't done it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-10-06 Thread Itagaki Takahiro
On Wed, Sep 8, 2010 at 1:02 AM, Teodor Sigaev teo...@sigaev.ru wrote:
 Fixed, and slightly reworked to be more clear.
 Attached patch is based on your patch.

The patch will improve accuracy of plans using gin indexes.
It only adds block-level statistics information into the meta
pages in gin indexes. Data-level statistics are not collected
by the patch, and there are no changes in pg_statistic.

The stats page is updated only in VACUUM. ANALYZE doesn't update
the information at all. In addition, REINDEX, VACUUM FULL, and
CLUSTER reset the information to zero, but the reset is not preferable.
Is it possible to fill the statistic fields at bulk index-build?
No one wants to run VACUUM after VACUUM FULL to update the GIN stats.

We don't have any methods to dump the meta information at all.
They might be internal information, but some developers and
debuggers might want such kinds of tools. Contrib/pageinspect
might be a good location to have such function; it has bt_metap().

The patch can be applied cleanly, no compiler warnings, and it passed
all existing regression tests. There are no additional documentation
and regression tests -- I'm not sure whether we should have them.
If the patch is an internal improvement, docs are not needed.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-09-07 Thread Teodor Sigaev

I also dropped the use of rd_amcache, instead having ginGetStats()

Ok, I'm agree


I didn't do anything about the questionable equations in
gincostestimate.  Those need to either be fixed, or documented as
to why they're correct.  Other than that I think this could be
committed.


Fixed, and slightly reworked to be more clear.
Attached patch is based on your patch.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


gincostestimate-0.24.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-08-07 Thread Tom Lane
I wrote:
 1. The use of rd_amcache is very questionable.

Attached is an alternate patch that I think you should give serious
consideration to.  The basic idea here is to only update the metapage
stats data during VACUUM, and not bother with incremental updates during
other operations.  That gets rid of a lot of the complexity and
opportunities for bugs-of-omission in the original approach, and also
reduces contention for the metapage as well as WAL traffic.
gincostestimate can compensate fairly well for index growth since the
last VACUUM by scaling up the recorded values by the known growth ratio
of the overall index size.  (Note that the index-pages count passed to
gincostestimate is accurate, having been recently gotten from
RelationGetNumberOfBlocks.)  Of course, this is only approximate, but
considering that the equations the values are going to be fed into are
even more approximate, I don't see a problem with that.

I also dropped the use of rd_amcache, instead having ginGetStats()
just read the metapage every time.  Since the planner stats per se
are now only updated during vacuum, it would be reasonable to use
rd_amcache to remember them, but there's still a problem with
nPendingPages.  I think that keeping it simple is the way to go,
at least until someone can show a performance problem with this way.

I didn't do anything about the questionable equations in
gincostestimate.  Those need to either be fixed, or documented as
to why they're correct.  Other than that I think this could be
committed.

regards, tom lane

PS: I still haven't tested this further than running the regression
tests, since I see little point in trying to check its estimation
behavior until those equations are fixed.  However, the hstore
regression test did expose a core dump in gincostestimate (failing
to guard against null partial_matches), which I have fixed here.



bink8QUUkvuQI.bin
Description: gincostestimate-0.21.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-08-06 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 [ review of gincostestimate-0.19 ]

I went through this patch, re-synced with current HEAD, and did some
minor editorializing; a new version is attached.  (Caution: I have not
tested this beyond verifying that it still compiles.)

 Codewise I have one question: the patch changes a loop in 
 ginvacuumcleanup from
 for (blkno = GIN_ROOT_BLKNO + 1; blkno  npages; blkno++)
 to
 for (blkno = GIN_ROOT_BLKNO; blkno  npages; blkno++)
 why should it now go through all blocks?

I think this is correct.  Before, vacuum had nothing useful to do on the
root page so it just skipped it.  Now, it needs to count the root page
in the appropriate way in the stats it's gathering.  The previous coding
maybe could have used a comment, but this version is unsurprising.

 The patch has lots of statements like if ( GinPageIsLeaf(page) ), that 
 is with extra space between the outer parenthesis and the condition, 
 which AFAIK is not the project style. I guess pgindent fixes that, so 
 it's no big deal.

 There are lines with elog(NOTICE) that are #if 0, they probably should 
 either become elog(DEBUGX) or get removed.

I fixed the latter and cleaned up some of the formatting violations,
though not all.  I dunno if anyone feels like running pgindent on the
patch at the moment.

I think there are two big problems left before this patch can be
applied:

1. The use of rd_amcache is very questionable.  There's no provision for
updates executed by one session to get reflected into stats already
cached in another session.  You could fix that by forcing relcache
flushes whenever you update the stats, as btree does:

/* send out relcache inval for metapage change */
if (!InRecovery)
CacheInvalidateRelcache(rel);

However I think that's probably a Really Bad Idea, because it would
result in extremely frequent relcache flushes, and those are expensive.
(The reason this mechanism is good for btree is it only needs to update
its cache after a root page split, which is infrequent.)  My advice is
to drop the use of rd_amcache completely, and just have the code read
the metapage every time gincostestimate runs.  If you don't like that
then you're going to need to think hard about how often to update the
cache and what can drive that operation at the right times.

BTW, another problem that would need to be fixed if you keep this code
is that ginUpdateStatInPlace wants to force the new values into
rd_amcache, which (a) is pretty useless and (b) risks a PANIC on
palloc failure, because it's called from a critical section.

2. Some of the calculations in gincostestimate seem completely bogus.
In particular, this bit:

/* calc partial match: we don't need a search but an index scan */
*indexStartupCost += partialEntriesInQuals * numEntryPages / numEntries;

is surely wrong, because the number being added to indexStartupCost is
a pure ratio not scaled by any cost unit.  I don't understand what this
number is supposed to be, so it's not clear what cost variable ought to
be included.

This equation doesn't seem amazingly sane either:

/* cost to scan data pages for each matched entry */
pagesFetched = ceil((searchEntriesInQuals + partialEntriesInQuals) *
 numDataPages / numEntries);

This has pagesFetched *decreasing* as numEntries increases, which cannot
be right can it?

Also, right after that step, there's a bit of code that re-estimates
pagesFetched from selectivity and uses the larger value.  Fine, but why
are you only applying that idea here and not to the entry-pages
calculation done a few lines earlier?

regards, tom lane



binfclKHw9sIz.bin
Description: gincostestimate-0.20.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-07-30 Thread Jan Urbański
OK, here's a review, as much as I was able to do it without 
understanding deeply how GIN works.


The patch is context, applies cleanly to HEAD, compiles without warnings 
and passes regression tests.


Using the script from 
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php I 
was able to get an index scan with commonterm40, while with the 
unpatched source I was getting an index scan only for commonterm20, so 
it indeed improves the situation as far as cost estimation is concerned.


Codewise I have one question: the patch changes a loop in 
ginvacuumcleanup from


   for (blkno = GIN_ROOT_BLKNO + 1; blkno  npages; blkno++)

to

   for (blkno = GIN_ROOT_BLKNO; blkno  npages; blkno++)

why should it now go through all blocks? I couldn't immediately see why 
was it not OK to do it before and why is it OK to do it now, but I don't 
really get how GIN works internally. I guess a comment would be good to 
have there in any case.


The patch has lots of statements like if ( GinPageIsLeaf(page) ), that 
is with extra space between the outer parenthesis and the condition, 
which AFAIK is not the project style. I guess pgindent fixes that, so 
it's no big deal.


There are lines with elog(NOTICE) that are #if 0, they probably should 
either become elog(DEBUGX) or get removed.


As for performance, I tried running the attached script a couple of 
times. I used the standard config file, only changed checkpoint_segments 
to 30 and shared_buffers to 512MB. The timings were:


HEAD

INSERT 0 50
Time: 13487.450 ms
VACUUM
Time: 337.673 ms

INSERT 0 50
Time: 13751.110 ms
VACUUM
Time: 315.812 ms

INSERT 0 50
Time: 12691.259 ms
VACUUM
Time: 312.320 ms

HEAD + gincostestimate

INSERT 0 50
Time: 13961.894 ms
VACUUM
Time: 355.798 ms

INSERT 0 50
Time: 14114.975 ms
VACUUM
Time: 341.822 ms

INSERT 0 50
Time: 13679.871 ms
VACUUM
Time: 340.576 ms

so there is no immediate slowdown for a quick test with one client.

Since there was no stability or performance issues and it solves the 
problem, I am marking this as ready for committer, although it might be 
beneficial if someone more acquianted with GIN takes another look at it 
before the committer review.


I will be travelling during the whole August and will only have 
intermittent email access, so in case of any questions with regards to 
review the respionse time can be a few days.


Cheers,
Jan
drop table if exists x;
create table x(f tsvector);
create index x_idx on x using gin(f);
insert into x (select 'test words taken at random but a few of them indeed' 
from generate_series(1, 50) g);
vacuum;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-07-30 Thread Robert Haas
On Fri, Jul 30, 2010 at 1:19 PM, Jan Urbański wulc...@wulczer.org wrote:
 The patch has lots of statements like if ( GinPageIsLeaf(page) ), that is
 with extra space between the outer parenthesis and the condition, which
 AFAIK is not the project style. I guess pgindent fixes that, so it's no big
 deal.

It's better if these get cleaned up.  pgindent will fix it eventually,
but the less stuff pgindent has to touch, the less likelihood there is
of breaking outstanding patches when it's run.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-07-26 Thread Oleg Bartunov

Jan,

On Sun, 25 Jul 2010, Jan Urbaski wrote:


On 02/07/10 14:33, Teodor Sigaev wrote:

Patch implements much more accuracy estimation of cost for GIN index
scan than generic cost estimation function.


Hi,

I'm reviewing this patch, and to begin with it I tried to reproduce the
problem that originally came up on -performance in
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php


I attached scripts



The links from that mail are now dead, so I set up my own test environment:
* one table testfts(id serial, body text, body_fts tsvector)
* 5 rows, each with 1000 random words taken from
/usr/share/dict/british-english-insane (the wbritish-insane Debian
package) separated by a single space
* each row also had the word commonterm at the end, 80% had
commonterm80, 60% had commonterm60 etc (using the same methodology as
Jesper, that commonterm60 can appear only if commonterm80 is in the row)
* a GIN index on the tsvectors

I was able to reproduce his issue, that is: select id from ftstest where
body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
which was resulting in much longer execution than the bitmap index plan
that I got after disabling seqscans.

I then applied the patch, recompiled PG and tried again... and nothing
changed. I first tried running ANALYSE and then dropping and recreating
the GIN index, but the planner still chooses the seq scan.


read thread
http://archives.postgresql.org/pgsql-hackers/2010-04/msg01407.php
There is always a fuzz factor, as Tom said, about 1% in path cost comparisons.
You may compare plans for 'commonterm60', 'commonterm40'.



Full explains below (the NOTICE is a debugging aid from the patch, which
I temporarily enabled to see if it's picking up the code).


from this debug you can see that cost estimation now are much accurate
than before.



I'll continue reading the code and trying to understand what it does,
but in the meantime: am I doing something wrong that I don't see the
planner switching to the bitmap index plan? I see that the difference in
costs is small, so maybe I just need to tweak the planner knobs a bit?
Is the output below expected?


I think Tom explained this
http://archives.postgresql.org/pgsql-hackers/2010-04/msg01426.php




Cheers,
Jan


wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE:  GIN stats: nEntryPages: 49297.00 nDataPages: 16951.00
nPendingPages :0.00 nEntries: 277521.00
   QUERY PLAN

--
Seq Scan on ftstest  (cost=0.00..1567.00 rows=39890 width=4) (actual
time=221.893..33179.794 rows=39923 loops=1)
  Filter: (body_fts @@ to_tsquery('commonterm80'::text))
Total runtime: 33256.661 ms
(3 rows)

wulczer=# set enable_seqscan to false;
SET
Time: 0.257 ms
wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE:  GIN stats: nEntryPages: 49297.00 nDataPages: 16951.00
nPendingPages :0.00 nEntries: 277521.00
QUERY PLAN


Bitmap Heap Scan on ftstest  (cost=449.15..1864.50 rows=39890 width=4)
(actual time=107.421..181.284 rows=39923 loops=1)
  Recheck Cond: (body_fts @@ to_tsquery('commonterm80'::text))
  -  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..439.18
rows=39890 width=0) (actual time=97.057..97.057 rows=39923 loops=1)
Index Cond: (body_fts @@ to_tsquery('commonterm80'::text))
Total runtime: 237.218 ms
(5 rows)

Time: 237.999 ms




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

a
able
about
account
acid
across
act
addition
adjustment
advertisement
after
again
against
agreement
air
all
almost
among
amount
amusement
and
angle
angry
animal
answer
ant
any
apparatus
apple
approval
arch
argument
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
ball
band
base
basin
basket
bath
be
beautiful
because
bed
bee
before
behaviour
belief
bell
bent
berry
between
bird
birth
bit
bite
bitter
black
blade
blood
blow
blue
board
boat
body
boiling
bone
book
boot
bottle
box
boy
brain
brake
branch
brass
bread
breath
brick
bridge
bright
broken
brother
brown
brush
bucket
building
bulb
burn
burst
business
but
butter
button
by
cake
camera
canvas
card
care
carriage
cart
cat
cause
certain
chain
chalk
chance
change
cheap
cheese
chemical
chest
chief
chin
church
circle
clean
clear
clock
cloth
cloud
coal

Re: [HACKERS] gincostestimate

2010-07-26 Thread Jan Urbański

On 26/07/10 12:58, Oleg Bartunov wrote:

Jan,

On Sun, 25 Jul 2010, Jan Urbaski wrote:


On 02/07/10 14:33, Teodor Sigaev wrote:

Patch implements much more accuracy estimation of cost for GIN index
scan than generic cost estimation function.



I was able to reproduce his issue, that is: select id from ftstest where
body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
which was resulting in much longer execution than the bitmap index plan
that I got after disabling seqscans.

I then applied the patch, recompiled PG and tried again... and nothing
changed. I first tried running ANALYSE and then dropping and recreating
the GIN index, but the planner still chooses the seq scan.


read thread
http://archives.postgresql.org/pgsql-hackers/2010-04/msg01407.php
There is always a fuzz factor, as Tom said, about 1% in path cost
comparisons.
You may compare plans for 'commonterm60', 'commonterm40'.


OK, I thought this might be the case, as with the patch the sequential 
scan is

winning only be a small margin.

Thanks,
Jan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gincostestimate

2010-07-25 Thread Jan Urbański
On 02/07/10 14:33, Teodor Sigaev wrote:
 Patch implements much more accuracy estimation of cost for GIN index
 scan than generic cost estimation function.

Hi,

I'm reviewing this patch, and to begin with it I tried to reproduce the
problem that originally came up on -performance in
http://archives.postgresql.org/pgsql-performance/2009-10/msg00393.php

The links from that mail are now dead, so I set up my own test environment:
 * one table testfts(id serial, body text, body_fts tsvector)
 * 5 rows, each with 1000 random words taken from
/usr/share/dict/british-english-insane (the wbritish-insane Debian
package) separated by a single space
 * each row also had the word commonterm at the end, 80% had
commonterm80, 60% had commonterm60 etc (using the same methodology as
Jesper, that commonterm60 can appear only if commonterm80 is in the row)
 * a GIN index on the tsvectors

I was able to reproduce his issue, that is: select id from ftstest where
body_fts @@ to_tsquery('commonterm80'); was choosing a sequential scan,
which was resulting in much longer execution than the bitmap index plan
that I got after disabling seqscans.

I then applied the patch, recompiled PG and tried again... and nothing
changed. I first tried running ANALYSE and then dropping and recreating
the GIN index, but the planner still chooses the seq scan.

Full explains below (the NOTICE is a debugging aid from the patch, which
I temporarily enabled to see if it's picking up the code).

I'll continue reading the code and trying to understand what it does,
but in the meantime: am I doing something wrong that I don't see the
planner switching to the bitmap index plan? I see that the difference in
costs is small, so maybe I just need to tweak the planner knobs a bit?
Is the output below expected?

Cheers,
Jan


wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE:  GIN stats: nEntryPages: 49297.00 nDataPages: 16951.00
nPendingPages :0.00 nEntries: 277521.00
QUERY PLAN

--
 Seq Scan on ftstest  (cost=0.00..1567.00 rows=39890 width=4) (actual
time=221.893..33179.794 rows=39923 loops=1)
   Filter: (body_fts @@ to_tsquery('commonterm80'::text))
 Total runtime: 33256.661 ms
(3 rows)

wulczer=# set enable_seqscan to false;
SET
Time: 0.257 ms
wulczer=# explain analyse select id from ftstest where body_fts @@
to_tsquery('commonterm80');
NOTICE:  GIN stats: nEntryPages: 49297.00 nDataPages: 16951.00
nPendingPages :0.00 nEntries: 277521.00
 QUERY PLAN


 Bitmap Heap Scan on ftstest  (cost=449.15..1864.50 rows=39890 width=4)
(actual time=107.421..181.284 rows=39923 loops=1)
   Recheck Cond: (body_fts @@ to_tsquery('commonterm80'::text))
   -  Bitmap Index Scan on ftstest_gin_idx  (cost=0.00..439.18
rows=39890 width=0) (actual time=97.057..97.057 rows=39923 loops=1)
 Index Cond: (body_fts @@ to_tsquery('commonterm80'::text))
 Total runtime: 237.218 ms
(5 rows)

Time: 237.999 ms

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] gincostestimate

2010-07-02 Thread Teodor Sigaev
Patch implements much more accuracy estimation of cost for GIN index scan than 
generic cost estimation function. Basic idea is described on PGCon-2010: 
http://www.sai.msu.su/~megera/postgres/talks/pgcon-2010-1.pdf, pages 48-54.


After discussion on PGCon, the storage of additional statistic information has 
been changed from pg_class table to meta-page of index. Statistics data is 
cached in Relation-rd_amcache to prevent frequent read of meta-page.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


gincostestimate-0.19.gz
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers