date:20080530

[PATCHES] GIN improvements

2008-05-30 Thread Teodor Sigaev



Improvements of GIN indexes were presented on PGCon 2008. Presentation:
 http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf

1) multicolumn GIN
This patch ( http://www.sigaev.ru/misc/multicolumn_gin-0.2.gz ) adds multicolumn 
support to GIN. The basic idea is: keys (entries in GIN terminology) extracted 
from values are stored in separated tuples along with their column number. In 
that case, multicolumn clause is  just AND of column's clauses. Unlike other 
indexes, the performance of search doesn't depends on what column of index 
(first, last, any subset) is used in search clause. This property can be used in 
gincostestimate, but I haven't looked on it yet.


2) fast insert into GIN
This patch ( http://www.sigaev.ru/misc/fast_insert_gin-0.4.gz ) implements an 
idea of using bulk insert technique, which used at index creation time. Inserted 
rows are stored in the linked list of pending pages and inserted to the regular 
structure of GIN at vacuum time. The algorithm is shown in presentation, but 
insert completion process (vacuum) was significantly reworkes to improve 
concurrency. Now, the list of pending page is locked much lesser time - only 
during deletion of pages from the list.


Open item:
what is a right time to call insert completion? Currently, it is called by 
ginbulkdelete and ginvacuumcleanup, ginvacuumcleanup will call completion if 
ginbulkdelete wasn't called. That's not good, but works. Completion process 
should started before ginbulkdelete because ginbulkdelete doesn't look on

pending pages at all.

Since insert completion (of any index if that method will exists, I think) runs 
fast if number of inserted tuples is a small because it doesn't go through the 
whole index, so, IMHO, the existing statistic's fields should not be changed. 
That idea, discussed at PGCon, is to have trigger in vacuum which will be fired 
if number of inserted tuples becomes big. Now I don't think that the  idea is 
useful for two reason: for small number of tuples completion is a cheap and it 
should be called before ginbulkdelete. IMHO, it's enough to add an optional 
method to pg_am (aminsertcleanup, per Tom's suggestion). This method will be 
called before ambulkdelete and amvacuumcleanup. Opinions, objections, suggestions?


On presentation some people were interested on how our changes affect the
search speed after rows insert. The tests are below: We use the same tables as 
in presentation and measure search times ( after insertion of some rows ) before 
and after vacuum. All times are in ms. Test tables contain 10 rows, in the 
first table the number of elements in array is 100 with cardinality = 500, 
second - 100 and 50, last - 1000 and 500.


Insert 1 into table with 10 rows (10%)
 |v && '{v1}'   |
-+-++ found
 | novac-d |  vac-d |  rows
-+-++---
n:100,  c:500|   118   |35  | 19909
n:100,  c:50 |95   |   0.7  |25
n:1000, c:500|   380   |   79   | 95211


Insert 1000 into table with 10 rows (1%)
 |v && '{v1}'   |
-+-++ found
 | novac-d |  vac-d |  rows
-+-++---
n:100,  c:500|40   |31  | 18327
n:100,  c:50 |13   |   0.5  |26
n:1000, c:500|   102   |71  | 87499

Insert 100 into table with 10 rows (0.1%)
 |v && '{v1}'   |
-+-++ found
 | novac-d |  vac-d |  rows
-+-++---
n:100,  c:500|32   |31  | 18171
n:100,  c:50 |   1.7   |   0.5  |20
n:1000, c:500|74   |71  | 87499

Looking at result it's easy to conclude that:
 - time of search pending list is O(number of inserted rows), i.e., search time
   is equal to (time of search in GIN) + K1 * (number of inserted rows after the
   last vacuum).
 - search time is O(average length of indexed columns). Observations made above
   is also applicable here.
 - significant performance gap starts around 5-10% of inserts or near 500-1000
   inserts.  This is very depends on specific dataset.

Notice, that insert performance to GIN was increased up to 10 times. See
exact results in presentation.

Do we need to add option to control this (fast insertion) feature?
If so, what is a default value? It's not clear to me.

Note: These patches are mutually exclusive because they touch the same pieces
of code and I'm too lazy to manage several depending patches. I don't see any
problem to join patches to one, but IMHO it will be difficult to review.

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsq

[PATCHES] partial header cleanup

2008-05-30 Thread Zdenek Kotala

This replace xlog.h with xlogdefs.h in bufpage.h. All other changes are 
forgotten include somewhere. It reduce e.g. bloat to half in itup.h. But, There 
are still unresolved problems. htup should include bufpage.h, because it needs 
PageHeader size, but there is still unnecessary bufmgr.h include in bufpage 
which generates bloat.


See itup.h bloating:

http://doxygen.postgresql.org/itup_8h.html

this patch reduce xlog side. But there still about 18 unnecessary includes.



Zdenek

PS: Thanks to Stefan K. He enabled graphs.
Index: src/backend/nodes/print.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/nodes/print.c,v
retrieving revision 1.87
diff -c -r1.87 print.c
*** src/backend/nodes/print.c	1 Jan 2008 19:45:50 -	1.87
--- src/backend/nodes/print.c	30 May 2008 15:13:42 -
***
*** 20,25 
--- 20,26 
  #include "postgres.h"
  
  #include "access/printtup.h"
+ #include "lib/stringinfo.h"
  #include "nodes/print.h"
  #include "optimizer/clauses.h"
  #include "parser/parsetree.h"
Index: src/backend/postmaster/postmaster.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.557
diff -c -r1.557 postmaster.c
*** src/backend/postmaster/postmaster.c	4 May 2008 21:13:35 -	1.557
--- src/backend/postmaster/postmaster.c	30 May 2008 15:13:42 -
***
*** 93,98 
--- 93,99 
  #endif
  
  #include "access/transam.h"
+ #include "access/xlog.h"
  #include "bootstrap/bootstrap.h"
  #include "catalog/pg_control.h"
  #include "lib/dllist.h"
Index: src/backend/utils/adt/domains.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/adt/domains.c,v
retrieving revision 1.6
diff -c -r1.6 domains.c
*** src/backend/utils/adt/domains.c	1 Jan 2008 19:45:52 -	1.6
--- src/backend/utils/adt/domains.c	30 May 2008 15:13:42 -
***
*** 33,38 
--- 33,39 
  
  #include "commands/typecmds.h"
  #include "executor/executor.h"
+ #include "lib/stringinfo.h"
  #include "utils/builtins.h"
  #include "utils/lsyscache.h"
  
Index: src/backend/utils/fmgr/fmgr.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/fmgr/fmgr.c,v
retrieving revision 1.119
diff -c -r1.119 fmgr.c
*** src/backend/utils/fmgr/fmgr.c	15 May 2008 00:17:40 -	1.119
--- src/backend/utils/fmgr/fmgr.c	30 May 2008 15:13:42 -
***
*** 19,24 
--- 19,25 
  #include "catalog/pg_language.h"
  #include "catalog/pg_proc.h"
  #include "executor/functions.h"
+ #include "lib/stringinfo.h"
  #include "miscadmin.h"
  #include "parser/parse_expr.h"
  #include "pgstat.h"
Index: src/include/access/gin.h
===
RCS file: /projects/cvsroot/pgsql/src/include/access/gin.h,v
retrieving revision 1.20
diff -c -r1.20 gin.h
*** src/include/access/gin.h	16 May 2008 16:31:01 -	1.20
--- src/include/access/gin.h	30 May 2008 15:13:43 -
***
*** 14,19 
--- 14,20 
  
  #include "access/itup.h"
  #include "access/relscan.h"
+ #include "access/xlog.h"
  #include "fmgr.h"
  #include "nodes/tidbitmap.h"
  #include "storage/block.h"
Index: src/include/access/heapam.h
===
RCS file: /projects/cvsroot/pgsql/src/include/access/heapam.h,v
retrieving revision 1.134
diff -c -r1.134 heapam.h
*** src/include/access/heapam.h	12 May 2008 00:00:53 -	1.134
--- src/include/access/heapam.h	30 May 2008 15:13:43 -
***
*** 17,22 
--- 17,23 
  #include "access/htup.h"
  #include "access/relscan.h"
  #include "access/sdir.h"
+ #include "access/xlog.h"
  #include "nodes/primnodes.h"
  #include "storage/lock.h"
  #include "utils/snapshot.h"
Index: src/include/access/nbtree.h
===
RCS file: /projects/cvsroot/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.118
diff -c -r1.118 nbtree.h
*** src/include/access/nbtree.h	16 Apr 2008 23:59:40 -	1.118
--- src/include/access/nbtree.h	30 May 2008 15:13:43 -
***
*** 17,22 
--- 17,23 
  #include "access/itup.h"
  #include "access/relscan.h"
  #include "access/sdir.h"
+ #include "access/xlog.h"
  #include "access/xlogutils.h"
  
  
Index: src/include/storage/bufpage.h
===
RCS file: /projects/cvsroot/pgsql/src/include/storage/bufpage.h,v
retrieving revision 1.79
diff -c -r1.79 bufpage.h
*** src/include/storage/bufpage.h	12 May 2008 16:06:10 -	1.79
--- src/include/storage/bufpage.h	30 May 2008 15:13:43 -
***
*** 14,20 
  #ifndef BUFPAGE_H
  #define BUFPAGE_H
  
! #include "access/xlog.h"
  #include "storage/bufmgr.h"
  #

[PATCHES] GIN improvements

[PATCHES] partial header cleanup

2 matches

Site Navigation

Mail list logo

Footer information