Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE
On 2010-08-31 12:07 AM +0300, I wrote: I looked at fixing this.. The previous patch had a bug in fmgr_sql() our regression tests didn't catch. Fixed version attached. Regards, Marko Tiikkaja *** a/src/backend/catalog/pg_proc.c --- b/src/backend/catalog/pg_proc.c *** *** 832,838 fmgr_sql_validator(PG_FUNCTION_ARGS) proc->proargtypes.values, proc->pronargs); (void) check_sql_fn_retval(funcoid, proc->prorettype, ! querytree_list, NULL, NULL); } else --- 832,838 proc->proargtypes.values, proc->pronargs); (void) check_sql_fn_retval(funcoid, proc->prorettype, ! llast(querytree_list), NULL, NULL); } else *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** *** 90,107 typedef struct ParamListInfo paramLI; /* Param list representing current args */ Tuplestorestate *tstore;/* where we accumulate result tuples */ JunkFilter *junkFilter; /* will be NULL if function returns VOID */ ! /* head of linked list of execution_state records */ ! execution_state *func_state; } SQLFunctionCache; typedef SQLFunctionCache *SQLFunctionCachePtr; /* non-export function prototypes */ ! static execution_state *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK); --- 90,107 ParamListInfo paramLI; /* Param list representing current args */ Tuplestorestate *tstore;/* where we accumulate result tuples */ + Snapshotsnapshot; JunkFilter *junkFilter; /* will be NULL if function returns VOID */ ! List *func_state; } SQLFunctionCache; typedef SQLFunctionCache *SQLFunctionCachePtr; /* non-export function prototypes */ ! static List *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK); *** *** 123,183 static void sqlfunction_destroy(DestReceiver *self); /* Set up the list of per-query execution_state records for a SQL function */ ! static execution_state * init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK) { ! execution_state *firstes = NULL; ! execution_state *preves = NULL; execution_state *lasttages = NULL; ! ListCell *qtl_item; ! foreach(qtl_item, queryTree_list) { ! Query *queryTree = (Query *) lfirst(qtl_item); ! Node *stmt; ! execution_state *newes; ! Assert(IsA(queryTree, Query)); ! if (queryTree->commandType == CMD_UTILITY) ! stmt = queryTree->utilityStmt; ! else ! stmt = (Node *) pg_plan_query(queryTree, 0, NULL); ! /* Precheck all commands for validity in a function */ ! if (IsA(stmt, TransactionStmt)) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! /* translator: %s is a SQL statement name */ !errmsg("%s is not allowed in a SQL function", ! CreateCommandTag(stmt; ! if (fcache->readonly_func && !CommandIsReadOnly(stmt)) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! /* translator: %s is a SQL statement name */ !errmsg("%s is not allowed in a non-volatile function", ! CreateCommandTag(stmt; ! newes = (execution_state *) palloc(sizeof(execution_st
Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE
Hi, I looked at fixing this inconsistency by making all query list snapshot handling work like EXPLAIN ANALYZE's code does. The only reason I went this way was that implementing wCTEs on top of this behaviour is a lot easier. There were three places that needed fixing. The SPI and portal logic changes were quite straightforward, but the SQL language function code previously didn't know what query trees of the execution_state list belonged to which query so there was no way to tell when we actually needed to take a new snapshot. The approach I took was to change the representation of the SQL function cache to a list of execution_state lists, and grab a new snapshot between the lists. The patch needs a bit more comments and some cleaning up, but I thought I'd get your input first. Thoughts? Regards, Marko Tiikkaja *** a/src/backend/catalog/pg_proc.c --- b/src/backend/catalog/pg_proc.c *** *** 832,838 fmgr_sql_validator(PG_FUNCTION_ARGS) proc->proargtypes.values, proc->pronargs); (void) check_sql_fn_retval(funcoid, proc->prorettype, ! querytree_list, NULL, NULL); } else --- 832,838 proc->proargtypes.values, proc->pronargs); (void) check_sql_fn_retval(funcoid, proc->prorettype, ! llast(querytree_list), NULL, NULL); } else *** a/src/backend/executor/functions.c --- b/src/backend/executor/functions.c *** *** 90,107 typedef struct ParamListInfo paramLI; /* Param list representing current args */ Tuplestorestate *tstore;/* where we accumulate result tuples */ JunkFilter *junkFilter; /* will be NULL if function returns VOID */ ! /* head of linked list of execution_state records */ ! execution_state *func_state; } SQLFunctionCache; typedef SQLFunctionCache *SQLFunctionCachePtr; /* non-export function prototypes */ ! static execution_state *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK); --- 90,107 ParamListInfo paramLI; /* Param list representing current args */ Tuplestorestate *tstore;/* where we accumulate result tuples */ + Snapshotsnapshot; JunkFilter *junkFilter; /* will be NULL if function returns VOID */ ! List *func_state; } SQLFunctionCache; typedef SQLFunctionCache *SQLFunctionCachePtr; /* non-export function prototypes */ ! static List *init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK); static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK); *** *** 123,183 static void sqlfunction_destroy(DestReceiver *self); /* Set up the list of per-query execution_state records for a SQL function */ ! static execution_state * init_execution_state(List *queryTree_list, SQLFunctionCachePtr fcache, bool lazyEvalOK) { ! execution_state *firstes = NULL; ! execution_state *preves = NULL; execution_state *lasttages = NULL; ! ListCell *qtl_item; ! foreach(qtl_item, queryTree_list) { ! Query *queryTree = (Query *) lfirst(qtl_item); ! Node *stmt; ! execution_state *newes; ! Assert(IsA(queryTree, Query)); ! if (queryTree->commandType == CMD_UTILITY) ! stmt = queryTree->utilityStmt; ! else ! stmt = (Node *) pg_plan_query(queryTree, 0, NULL); ! /* Precheck all commands for validity in a function */ ! if (IsA(stmt, TransactionStmt)) ! ereport(ERROR, ! (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), ! /* translator: %s is a SQL statement name */ !e
Re: [HACKERS] How to construct an exact plan
2010/8/30 Pei He : > Hi, > I am hacking postgresql 8.2.5. a) and b) do not work for me. > > The situation is that I made a join operator, and a scan operator. > And, The join operator requires the scan operator as the inner. So, I > need to have the full control of the join plan. > > I am not ready to provide the optimization support for the two new > operators. And, I want to run some performance tests before working on > the optimization part. > > So, I want to know if it is possible to directly create a path or a > plan, and do a unit test for the operators. > yes, it is possible - but it isn't simple. I thing, so better is simple implementation of all parts and then runtime blocking some (for you not interesting) buildin methods via SET enable_ to off. Regards Pavel Stehule > > Thanks > -- > Pei > > On Mon, Aug 30, 2010 at 1:59 PM, Josh Berkus wrote: >> >>> I have developed a new operators, and I want to do some tests on it. >>> I do not want the optimizer to choose the plan for me, and I need to >>> construct a plan as exact as I want. >>> >>> Can anyone provide me a way to achieve that? >> >> a) easy: choose a simple enough query that its plan is always predictable. >> >> b) moderate: choose a query whose plan is predictable if you manipulate >> the enable_* configuration settings >> >> c) hard: hack the PostgreSQL planner to choose a specific execution >> plan, and recompile Postgres. >> >> -- >> -- Josh Berkus >> PostgreSQL Experts Inc. >> http://www.pgexperts.com >> > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to construct an exact plan
Hi, I am hacking postgresql 8.2.5. a) and b) do not work for me. The situation is that I made a join operator, and a scan operator. And, The join operator requires the scan operator as the inner. So, I need to have the full control of the join plan. I am not ready to provide the optimization support for the two new operators. And, I want to run some performance tests before working on the optimization part. So, I want to know if it is possible to directly create a path or a plan, and do a unit test for the operators. Thanks -- Pei On Mon, Aug 30, 2010 at 1:59 PM, Josh Berkus wrote: > >> I have developed a new operators, and I want to do some tests on it. >> I do not want the optimizer to choose the plan for me, and I need to >> construct a plan as exact as I want. >> >> Can anyone provide me a way to achieve that? > > a) easy: choose a simple enough query that its plan is always predictable. > > b) moderate: choose a query whose plan is predictable if you manipulate > the enable_* configuration settings > > c) hard: hack the PostgreSQL planner to choose a specific execution > plan, and recompile Postgres. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to construct an exact plan
I forgot to mention that I am using postgresql 8.2.5. Thanks -- Pei On Mon, Aug 30, 2010 at 1:34 PM, Pei He wrote: > Hi, > I have developed a new operators, and I want to do some tests on it. > I do not want the optimizer to choose the plan for me, and I need to > construct a plan as exact as I want. > > Can anyone provide me a way to achieve that? > > Thanks > -- > Pei > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] thousand unrelated data files in pg_default tablespace
2010/8/30 Tom Lane : > Pavel Stehule writes: >> I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the >> data directory is bloating. There are more than one hundred thousand >> files - 8KB or 0KB long. The filenames are not transformable to names >> via oid2name. Does somebody know about similar bug? > > 1. 8.3.what? > postgres=# select version(); version -- PostgreSQL 8.3.6 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) (1 row) > 2. Any signs of distress in the postmaster log? I'm wondering about > being unable to complete checkpoints, or repeated backend crashes that > might cause leakage of temp tables. No, there are nothing > > 3. What's in the files --- do they appear to be tables, indexes, random > temp files from sorts/hashes, or what? pg_filedump might help you here. > I have to contact admin tomorrow. For now - one half was zero length, second half was almost empty. These files are in directory related to pg_default tablespace. Regards Pavel Stehule > regards, tom lane > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] How to construct an exact plan
Hi, I have developed a new operators, and I want to do some tests on it. I do not want the optimizer to choose the plan for me, and I need to construct a plan as exact as I want. Can anyone provide me a way to achieve that? Thanks -- Pei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] thousand unrelated data files in pg_default tablespace
Pavel Stehule writes: > I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the > data directory is bloating. There are more than one hundred thousand > files - 8KB or 0KB long. The filenames are not transformable to names > via oid2name. Does somebody know about similar bug? 1. 8.3.what? 2. Any signs of distress in the postmaster log? I'm wondering about being unable to complete checkpoints, or repeated backend crashes that might cause leakage of temp tables. 3. What's in the files --- do they appear to be tables, indexes, random temp files from sorts/hashes, or what? pg_filedump might help you here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
I wrote: > I guess that something isn't properly setting up rnode.backend in > recovery processing, but didn't find it yet. CreateFakeRelcacheEntry is the culprit ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
"Kevin Grittner" writes: > LOG: database system was interrupted; last known up at 2010-08-30 > 09:13:23 CDT > LOG: database system was not properly shut down; automatic recovery in > progress > LOG: consistent recovery state reached at 0/5C5D04 > LOG: redo starts at 0/5C5D04 > TRAP: FailedAssertion("!(backend == (-1))", File: "catalog.c", Line: > 120) > LOG: startup process (PID 5338) was terminated by signal 6: Aborted > LOG: aborting startup due to startup process failure I can reproduce this; looks like it is fallout from the RelFileNodeBackend patch. Stack trace from core dump is #4 0x4f912c in ExceptionalCondition ( conditionName=0x89054 "!(backend == (-1))", errorType=0x89044 "FailedAssertion", fileName=0x88fc8 "catalog.c", lineNumber=120) at assert.c:57 #5 0x2473b8 in relpathbackend (rnode={spcNode = 1664, dbNode = 0, relNode = 11617}, backend=2063810256, forknum=2063672808) at catalog.c:120 #6 0x40a628 in mdopen (reln=0x40096748, forknum=VISIBILITYMAP_FORKNUM, behavior=EXTENSION_RETURN_NULL) at md.c:508 #7 0x409ce0 in mdexists (reln=0x40096748, forkNum=VISIBILITYMAP_FORKNUM) at md.c:237 #8 0x40c7ac in smgrexists (reln=0x7b011c28, forknum=2063848444) at smgr.c:207 #9 0x1f2464 in vm_readbuf (rel=0x40061bc0, blkno=0, extend=0 '\000') at visibilitymap.c:410 #10 0x1f1b80 in visibilitymap_clear (rel=0x7b011c28, heapBlk=2063848444) at visibilitymap.c:147 #11 0x1e924c in heap_xlog_insert (lsn={xlogid = 0, xrecoff = 26015704}, record=0x30) at heapam.c:4389 #12 0x1eaec8 in heap_redo (lsn={xlogid = 0, xrecoff = 26015704}, record=0x40083f70) at heapam.c:4823 #13 0x21bbd4 in StartupXLOG () at xlog.c:6229 #14 0x2215f8 in StartupProcessMain () at xlog.c:9233 #15 0x24571c in AuxiliaryProcessMain (argc=2, argv=0x7b03ade8) at bootstrap.c:412 #16 0x3cb560 in StartChildProcess (type=StartupProcess) at postmaster.c:4407 #17 0x3c6f6c in PostmasterMain (argc=3, argv=0x7b03aba0) at postmaster.c:1089 #18 0x34d998 in main (argc=3, argv=0x7b03aba0) at main.c:188 I guess that something isn't properly setting up rnode.backend in recovery processing, but didn't find it yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] thousand unrelated data files in pg_default tablespace
Hello I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the data directory is bloating. There are more than one hundred thousand files - 8KB or 0KB long. The filenames are not transformable to names via oid2name. Does somebody know about similar bug? Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] string function - "format" function proposal
2010/8/30 Alvaro Herrera : > Excerpts from Pavel Stehule's message of lun ago 30 07:51:55 -0400 2010: >> 2010/8/30 Itagaki Takahiro : >> > On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule >> > wrote: >> >> propsals: >> >> * "format" function - uses same formatting as PL/pgSQL RAISE statement >> >> * "sprintf" function >> >> >> >> Now I propose a compromise - "format" function with only three tags: >> >> %s .. some string >> >> %i .. SQL identifier >> >> %l .. string literal >> > >> > These are just ideas: >> > >> > * Use $n, as like as PREPARE command. >> > It allows for us to swap arguments in any order. >> > SELECT format('$2 before $1', 'aaa', 'bbb') >> >> what is use case for this feature? I don't see it. > > Translations :-) I haven't had a use for that but I've heard people > implements gettext of sorts in database tables. Maybe that kind of > thing would be of use here. > >> > * Call to_char() functions for each placeholder. >> > For example, >> > format('=={-MM-DD}==', tm::timestamp) >> > is equivalent to >> > '==' || to_char(tm, '-MM-DD') || '==' >> > '{}' prints the input with the default format. >> > >> > New languages' libraries might be of some help. LLs, C#, etc. >> >> I though about integration with to_char function too. There are not >> technical barrier. And I can live with just {to_char_format} too. It >> can be or cannot be mixed with basic tags together - there is >> specified a NULL value behave. If we allow {format} syntax, then we >> have to specify a escape syntax for { and }. Do you have a some idea? > > What about %{sth}? That way you don't need to escape {. The closing } would > need escaping only inside the %{} specifier, so {%{{\}MM}} prints > {2010{}08} So the above example is: then you need escaping too :) > > format('==%{-MM-DD}==', tm::timestamp); I am not sure if this is correct -but why not so there are possible combinations %s .. no quoting, NULL is '' %{} .. no quoting, NULL is NULL .. like output from to_char %{}s .. no quoting with formatting, NULL is '' now I have not idea about nice syntax for positional parameters - maybe %{...}$1s or we can use a two variants for tags - not positional '%' and positional '%', so $1{...}s, %{...}s, $1, %s, $1s, $1{...}, %{...} can be valid tags Regards Pavel Stehule > > Not sure about this to_char stuff though, seems too cute. You can do > the case above like this: > > format('==%s==', to_char(tm::timestamp, '-MM-DD')) > I like an using a format like tag - there are not technical problem - format can be taken from string and data type parameter can be known too. But this feature can be some enhancing. The basic features are NULL handling and right quoting. > -- > Álvaro Herrera > The PostgreSQL Company - Command Prompt, Inc. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 08/30/2010 04:52 PM, Tom Lane wrote: Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. I'll take a look a that, thanks. So I don't have any faith in untested assertions Neither do I. Thus I'm probably going to try my approach. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
Joe Conway writes: > I'm seeing it on a checkout from CVS head. Basically do Oh, OK, I misread Kevin to say this would happen during make installcheck itself. So the lack of buildfarm reports is not relevant. Will get some caffeine in me and then take a look. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
On 08/30/2010 08:16 AM, Tom Lane wrote: > "Kevin Grittner" writes: >> Checking out HEAD from the git repository I see an assertion >> failure. > > The buildfarm isn't reporting any such thing. Could you get a CVS > checkout and diff it against the git results? I'm seeing it on a checkout from CVS head. Basically do make make install initdb pg_ctl start make installcheck pg_ctl -m immediate stop pg_ctl start At this point I see the assertion in the logs. Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support signature.asc Description: OpenPGP digital signature
Re: [HACKERS] string function - "format" function proposal
Excerpts from Pavel Stehule's message of lun ago 30 07:51:55 -0400 2010: > 2010/8/30 Itagaki Takahiro : > > On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule > > wrote: > >> propsals: > >> * "format" function - uses same formatting as PL/pgSQL RAISE statement > >> * "sprintf" function > >> > >> Now I propose a compromise - "format" function with only three tags: > >> %s .. some string > >> %i .. SQL identifier > >> %l .. string literal > > > > These are just ideas: > > > > * Use $n, as like as PREPARE command. > > It allows for us to swap arguments in any order. > > SELECT format('$2 before $1', 'aaa', 'bbb') > > what is use case for this feature? I don't see it. Translations :-) I haven't had a use for that but I've heard people implements gettext of sorts in database tables. Maybe that kind of thing would be of use here. > > * Call to_char() functions for each placeholder. > > For example, > > format('=={-MM-DD}==', tm::timestamp) > > is equivalent to > > '==' || to_char(tm, '-MM-DD') || '==' > > '{}' prints the input with the default format. > > > > New languages' libraries might be of some help. LLs, C#, etc. > > I though about integration with to_char function too. There are not > technical barrier. And I can live with just {to_char_format} too. It > can be or cannot be mixed with basic tags together - there is > specified a NULL value behave. If we allow {format} syntax, then we > have to specify a escape syntax for { and }. Do you have a some idea? What about %{sth}? That way you don't need to escape {. The closing } would need escaping only inside the %{} specifier, so {%{{\}MM}} prints {2010{}08} So the above example is: format('==%{-MM-DD}==', tm::timestamp); Not sure about this to_char stuff though, seems too cute. You can do the case above like this: format('==%s==', to_char(tm::timestamp, '-MM-DD')) -- Álvaro Herrera The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
Tom Lane wrote: > Could you get a CVS checkout and diff it against the git results? Will do. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)
"Kevin Grittner" writes: > Checking out HEAD from the git repository I see an assertion > failure. The buildfarm isn't reporting any such thing. Could you get a CVS checkout and diff it against the git results? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Markus Wanner writes: > AFAICT we currently have three fixed size blocks to manage shared > buffers: the buffer blocks themselves, the buffer descriptors, the > strategy status (for the freelist) and the buffer lookup table. > It's not obvious to me how these data structures should perform better > than a dynamically allocated layout. Let me just point out that awhile back we got a *measurable* performance boost by eliminating a single indirect fetch from the buffer addressing code path. We used to have an array of pointers pointing to the actual buffers, and we removed that in favor of assuming the buffers were laid out in a contiguous array, so that the address of buffer N could be computed with a shift-and-add, eliminating the pointer fetch. I forget exactly what the numbers were, but it was significant enough to make us change it. So I don't have any faith in untested assertions that we can convert these data structures to use dynamic allocation with no penalty. It's very difficult to see how you'd do that without introducing a new layer of indirection, and our experience shows that that layer will cost you. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Assertion failure on HEAD (or at least git copy of it)
Checking out HEAD from the git repository I see an assertion failure. Steps to reproduce: make distclean ./configure --prefix=/usr/local/pgsql-serializable \ --enable-debug --enable-depend --enable-cassert make check sudo make install cd contrib/ make sudo make install cd ../ rm -fr /var/pgsql/data/serializable/* initdb /var/pgsql/data/serializable /usr/local/pgsql-serializable/pg_ctl \ -D /var/pgsql/data/serializable -l logfile start make installcheck # wait 30 to 60 seconds /usr/local/pgsql-serializable/pg_ctl \ -D /var/pgsql/data/serializable -m immediate stop /usr/local/pgsql-serializable/pg_ctl \ -D /var/pgsql/data/serializable -l logfile start The postmaster is not running. The tail of logfile is: LOG: wait_for_stats delayed 0.013267 seconds STATEMENT: SELECT wait_for_stats(); LOG: received immediate shutdown request WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: database system was interrupted; last known up at 2010-08-30 09:13:23 CDT LOG: database system was not properly shut down; automatic recovery in progress LOG: consistent recovery state reached at 0/5C5D04 LOG: redo starts at 0/5C5D04 TRAP: FailedAssertion("!(backend == (-1))", File: "catalog.c", Line: 120) LOG: startup process (PID 5338) was terminated by signal 6: Aborted LOG: aborting startup due to startup process failure Joe found this while beating up on the SSI patch, but I'm also seeing it on unpatched HEAD, as checked out from git. This is 32 bit kubuntu and ubuntu (different versions). -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cost_hashjoin
On Mon, 2010-08-30 at 13:34 +0100, Greg Stark wrote: > On Mon, Aug 30, 2010 at 10:18 AM, Simon Riggs wrote: > > cost_hashjoin() has some treatment of what occurs when numbatches > 1 > > but that additional cost is not proportional to numbatches. > > Because that's not how our hash batching works. We generate two temp > files for each batch, one for the outer and one for the inner. So if > we're batching then every tuple of both the inner and outer tables > (except for ones in the first batch) need to be written once and read > once regardless of the number of batches. Thanks for explaining. For some reason I thought we rewound the outer at the start of each batch, which is better for avoiding cache spoiling. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cost_hashjoin
On Mon, Aug 30, 2010 at 10:18 AM, Simon Riggs wrote: > cost_hashjoin() has some treatment of what occurs when numbatches > 1 > but that additional cost is not proportional to numbatches. Because that's not how our hash batching works. We generate two temp files for each batch, one for the outer and one for the inner. So if we're batching then every tuple of both the inner and outer tables (except for ones in the first batch) need to be written once and read once regardless of the number of batches. I do think the hash join implementation is a good demonstration of why C programming is faster at a micro-optimization level but slower at a macro level. Users of higher level languages would be much more likely to use any of the many fancier hashing data structures developed in the last few decades. in particular I think Cuckoo hashing would be interesting for us. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_subtrans keeps bloating up in the standby
On Mon, Aug 30, 2010 at 6:18 PM, Simon Riggs wrote: > On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote: > >> However, is it safe to use GetOldestXMin() during recovery? Or to put it >> other way, is GetOldestXMin() functioning correctly during hot standby? >> It only scans through the ProcArray, but not the known-assigned xids >> array. That seems like an oversight that needs to be fixed. > > Patch to implement that attached. The type of the value which KnownAssignedXidsGetOldestXmin returns should be TransactionId rather than int? Does caller of KnownAssignedXidsGetOldestXmin need to hold ProcArrayLock in (at least) shared mode? If yes, it's helpful to add that note as a comment. > Will then commit Fujii's patch. Thanks. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
register/unregister standby Re: [HACKERS] Synchronous replication
On Tue, Aug 10, 2010 at 5:58 AM, Heikki Linnakangas wrote: > On 05/08/10 17:14, Fujii Masao wrote: >> >> I'm thinking to make users register and unregister each standbys via SQL >> functions like register_standby() and unregister_standby(): > > The register/unregister facility should be accessible from the streaming > replication connection, so that you don't need to connect to any particular > database in addition to the streaming connection. Probably I've not understood your point correctly yet. I think that the advantage of registering standbys is that we can specify which WAL files the master has to keep for the upcoming standby. IMO, it's usually called together with pg_start_backup as follows: SELECT register_standby('foo', pg_start_backup()) This requests the master keep to all the WAL files following the backup starting location which pg_start_backup returns. Now we can do that by using wal_keep_segments, but it's not easy to set because it's difficult to predict how many WAL files the standby will require. So I've thought that the register/unregister facility should be used from the normal client connection. Why do you think it should be accessible from the SR connection? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] string function - "format" function proposal
2010/8/30 Itagaki Takahiro : > On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule > wrote: >> propsals: >> * "format" function - uses same formatting as PL/pgSQL RAISE statement >> * "sprintf" function >> >> Now I propose a compromise - "format" function with only three tags: >> %s .. some string >> %i .. SQL identifier >> %l .. string literal > > These are just ideas: > > * Use $n, as like as PREPARE command. > It allows for us to swap arguments in any order. > SELECT format('$2 before $1', 'aaa', 'bbb') > what is use case for this feature? I don't see it. > * Call to_char() functions for each placeholder. > For example, > format('=={-MM-DD}==', tm::timestamp) > is equivalent to > '==' || to_char(tm, '-MM-DD') || '==' > '{}' prints the input with the default format. > > New languages' libraries might be of some help. LLs, C#, etc. I though about integration with to_char function too. There are not technical barrier. And I can live with just {to_char_format} too. It can be or cannot be mixed with basic tags together - there is specified a NULL value behave. If we allow {format} syntax, then we have to specify a escape syntax for { and }. Do you have a some idea? Regards Pavel > > -- > Itagaki Takahiro > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] string function - "format" function proposal
On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule wrote: > propsals: > * "format" function - uses same formatting as PL/pgSQL RAISE statement > * "sprintf" function > > Now I propose a compromise - "format" function with only three tags: > %s .. some string > %i .. SQL identifier > %l .. string literal These are just ideas: * Use $n, as like as PREPARE command. It allows for us to swap arguments in any order. SELECT format('$2 before $1', 'aaa', 'bbb') * Call to_char() functions for each placeholder. For example, format('=={-MM-DD}==', tm::timestamp) is equivalent to '==' || to_char(tm, '-MM-DD') || '==' '{}' prints the input with the default format. New languages' libraries might be of some help. LLs, C#, etc. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] string function - "format" function proposal
Hello I am returning back to string functions. For me, the most important function isn't commited still. There was discussion about "format" or "sprintf" fuction. So I'll do a small resume. goal: to get function that helps with formatting a message texts and helps with building a SQL commands (used as dynamic SQL) propsals: * "format" function - uses same formatting as PL/pgSQL RAISE statement * "sprintf" function Itagaki objectives to "format" function: * there are not possibility put two parameters without a space between * it is too simple My objectives to sprintf function: * it is designed to different environment than SQL - missing support NULL, missing support for date, timestamp, boolean, ... * it is too complex, some parameters has different meaning for different tags * we have a "to_char" function for complex formatting now. Now I propose a compromise - "format" function with only three tags: %s .. some string %i .. SQL identifier %l .. string literal using a NULL: for %s NULL is transformed to empty string - like "concat" for %i NULL raises an exception for %l NULL is transformed to ' NULL ' string. This system is still simple and enough. Implemented sprintf function can be moved to "sprintf" contrib module. comments Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] cost_hashjoin
cost_hashjoin() has some treatment of what occurs when numbatches > 1 but that additional cost is not proportional to numbatches. The associated comment talks about "an extra time", making it sound like we think numbatches would only ever be 2 (or 1). Can someone explain the current code, or is there an oversight there? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing the Xact commit order to the user
On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote: > In some systems (data warehousing, replication), the order of commits is > important, since that is the order in which changes have become visible. > This information could theoretically be extracted from the WAL, but > scanning the entire WAL just to extract this tidbit of information would > be excruciatingly painful. This idea had support from at least 6 hackers. I'm happy to add my own. Can I suggest it is added as a hook, rather than argue about the details too much? The main use case is in combination with external systems, so that way we can maintain the relevant code with the system that cares about it. > CommitTransaction() inside of xact.c will call a function, that inserts > a new record into this array. The operation will for most of the time be > nothing than taking a spinlock and adding the record to shared memory. > All the data for the record is readily available, does not require > further locking and can be collected locally before taking the spinlock. > The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the > commit_timestamp is what CommitTransaction() just decided to write into > the WAL commit record and the total_rowcount is the sum of inserted, > updated and deleted heap tuples during the transaction, which should be > easily available from the statistics collector, unless row stats are > disabled, in which case the datum would be zero. Does this need to be called while in a critical section? Or can we wait until after the actual marking of the commit before calling this? > Checkpoint handling will call a function to flush the shared buffers. > Together with this, the information from WAL records will be sufficient > to recover this data (except for row counts) during crash recovery. So it would need to work identically in recovery also? These two values are not currently stored in the commit WAL record. timestamptz xci_begin_timestamp int64 xci_total_rowcount Both of those seem optional, so I don't really want them added to WAL. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_subtrans keeps bloating up in the standby
On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote: > However, is it safe to use GetOldestXMin() during recovery? Or to put it > other way, is GetOldestXMin() functioning correctly during hot standby? > It only scans through the ProcArray, but not the known-assigned xids > array. That seems like an oversight that needs to be fixed. Patch to implement that attached. This is necessary since CreateCheckpoint is called during end of recovery, though at that point there are still xids in KnownAssignedXids since they aren't removed until slightly later. Not hugely important. Also allows GetOldestXmin to be safely called elsewhere, such as Fujii's earlier patch on this thread. Any objections to commit to both head and 9.0? Will then commit Fujii's patch. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index bba247b..7c93f1e 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -156,6 +156,7 @@ static int KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax); static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, TransactionId xmax); +static int KnownAssignedXidsGetOldestXmin(void); static void KnownAssignedXidsDisplay(int trace_level); /* @@ -1112,6 +1113,18 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum) } } + if (RecoveryInProgress()) + { + /* + * Check to see whether KnownAssignedXids contains an xid value + * older than the main procarray. + */ + TransactionId kaxmin = KnownAssignedXidsGetOldestXmin(); + if (TransactionIdIsNormal(kaxmin) && + TransactionIdPrecedes(kaxmin, result)) +result = kaxmin; + } + LWLockRelease(ProcArrayLock); /* @@ -3015,6 +3028,33 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin, return count; } +static int +KnownAssignedXidsGetOldestXmin(void) +{ + /* use volatile pointer to prevent code rearrangement */ + volatile ProcArrayStruct *pArray = procArray; + int head, +tail; + int i; + + /* + * Fetch head just once, since it may change while we loop. + */ + SpinLockAcquire(&pArray->known_assigned_xids_lck); + tail = pArray->tailKnownAssignedXids; + head = pArray->headKnownAssignedXids; + SpinLockRelease(&pArray->known_assigned_xids_lck); + + for (i = tail; i < head; i++) + { + /* Skip any gaps in the array */ + if (KnownAssignedXidsValid[i]) + return KnownAssignedXids[i]; + } + + return InvalidTransactionId; +} + /* * Display KnownAssignedXids to provide debug trail * -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
(Sorry, need to disable Ctrl-Return, which quite often sends mails earlier than I really want.. continuing my mail) On 08/27/2010 10:46 PM, Robert Haas wrote: Yeah, probably. I think designing something that works efficiently over a network is a somewhat different problem than designing something that works on an individual node, and we probably shouldn't let the designs influence each other too much. Agreed. Thus I've left out any kind of congestion avoidance stuff from imessages so far. There's no padding or sophisticated allocation needed. You just need a pointer to the last byte read (P1), the last byte allowed to be read (P2), and the last byte allocated (P3). Writers take a spinlock, advance P3, release the spinlock, write the message, take the spinlock, advance P2, release the spinlock, and signal the reader. That would block parallel writers (i.e. only one process can write to the queue at any time). I feel like there's probably some variant of this idea that works around that problem. The problem is that when a worker finishes writing a message, he needs to know whether to advance P2 only over his own message or also over some subsequent message that has been fully written in the meantime. I don't know exactly how to solve that problem off the top of my head, but it seems like it might be possible. I've tried pretty much that before. And failed. Because the allocation-order (i.e. the time the message gets created in preparation for writing to it) isn't necessarily the same as the sending-order (i.e. when the process has finished writing and decides to send the message). To satisfy the FIFO property WRT the sending order, you need to decouple allocation form the ordering (i.e. queuing logic). (And yes, it has taken me a while to figure out what's wrong in Postgres-R, before I've even noticed about that design bug). Readers take the spinlock, read P1 and P2, release the spinlock, read the data, take the spinlock, advance P1, and release the spinlock. It would require copying data in case a process only needs to forward the message. That's a quick pointer dequeue and enqueue exercise ATM. If we need to do that, that's a compelling argument for having a single messaging area rather than one per backend. Absolutely, yes. But I'm not sure I see why we would need that sort of capability. Why wouldn't you just arrange for the sender to deliver the message directly to the final recipient? A process can read and even change the data of the message before forwarding it. Something the coordinator in Postgres-R does sometimes. (As it is the interface to the GCS and thus to the rest of the nodes in the cluster). For parallel querying (on a single node) that's probably less important a feature. So, they know in advance how large the message will be but not what the contents will be? What are they doing? Filling the message until it's (mostly) full and then continue with the next one. At least that's how the streaming approach on top of imessages works. But yes, it's somewhat annoying to have to know the message size in advance. I didn't implement realloc so far. Nor can I think of any other solution. Note that separation of allocation and queue ordering is required anyway for the above reasons. Well, the fact that something is commonly used doesn't mean it's right for us. Tabula raza, we might design the whole system differently, but changing it now is not to be undertaken lightly. Hopefully the above comments shed some light on my concerns. In short, (1) I don't want to preallocate a big chunk of memory we might not use, Isn't that's exactly what we do now for lots of sub-systems, and what I'd like to improve (i.e. reduce to a single big chunk). (2) I fear reducing the overall robustness of the system, and Well, that applies to pretty much every new feature you add. (3) I'm uncertain what other systems would be able leverage a dynamic allocator of the sort you propose. Okay, that's up to me to show evidences (or at least a PoC). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bg worker: patch 1 of 6 - permanent process
Hi, On 08/27/2010 10:46 PM, Robert Haas wrote: What other subsystems are you imagining servicing with a dynamic allocator? If there were a big demand for this functionality, we probably would have been forced to implement it already, but that's not the case. We've already discussed the fact that there are massive problems with using it for something like shared_buffers, which is by far the largest consumer of shared memory. Understood. I certainly plan to look into that for a better understanding of the problems those pose for dynamically allocated memory. I think it would be great if we could bring some more flexibility to our memory management. There are really two layers of problems here. Full ACK. One is resizing the segment itself, and one is resizing structures within the segment. As far as I can tell, there is no portable API that can be used to resize the shm itself. For so long as that remains the case, I am of the opinion that any meaningful resizing of the objects within the shm is basically unworkable. So we need to solve that problem first. Why should resizing of the objects within the shmem be unworkable? Doesn't my patch(es) prove the exact opposite? Being able to resize "objects" within the shm requires some kind of underlying dynamic allocation. And I rather like to be in control of that allocator than having to deal with two dozen different implementations on different OSes and their libraries. There are a couple of possible solutions, which have been discussed here in the past. I currently don't have much interest in dynamic resizing. Being able to resize the overall amount of shared memory on the fly would be nice, sure. But the total amount of RAM in a server changes rather infrequently. Being able to use what's available more efficiently is what I'm interested in. That doesn't need any kind of additional or different OS level support. It's just a matter of making better use of what's available - within Postgres itself. Next, we have to think about how we're going to resize data structures within this expandable shm. Okay, that's where I'm getting interested. Many of these structures are not things that we can easily move without bringing the system to a halt. For example, it's difficult to see how you could change the base address of shared buffers without ceasing all system activity, at which point there's not really much advantage over just forcing a restart. Similarly with LWLocks or the ProcArray. I guess that's what Bruce wanted to point out by saying our data structures are mostly "continuous". I.e. not dynamic lists or hash tables, but plain simple arrays. Maybe that's a subjective impression, but I seem to hear complaints about their fixed size and inflexibility quite often. Try to imagine the flexibility that dynamic lists could give us. And if you can't move them, then how will you grow them if (as will likely be the case) there's something immediately following them in memory. One possible solution is to divide up these data structures into "slabs". For example, we might imagine allocating shared_buffers in 1GB chunks. Why 1GB and do yet another layer of dynamic allocation within that? The buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit more for all of the book-keeping stuff. To make this work, we'd need to change the memory layout so that each chunk would include all of the miscellaneous stuff that we need to do bookkeeping for that chunk, such as the LWLocks and buffer descriptors. That doesn't seem completely impossible, but there would be some performance penalty, because you could no longer index into shared buffers from a single base offset. AFAICT we currently have three fixed size blocks to manage shared buffers: the buffer blocks themselves, the buffer descriptors, the strategy status (for the freelist) and the buffer lookup table. It's not obvious to me how these data structures should perform better than a dynamically allocated layout. One could rather argue that combining (some of) the bookkeeping stuff with data itself would lead to better locality and thus perform better. Instead, you'd need to determine which chunk contains the buffer you want, look up the base address for that chunk, and then index into the chunk. Maybe that overhead wouldn't be significant (or maybe it would); at any rate, it's not completely free. There's also the problem of handling the partial chunk at the end, especially if that happens to be the only chunk. This sounds way too complicated, yes. Use 8K chunks and most of the problems vanish. I think the problems for other arrays are similar, or more severe. I can't see, for example, how you could resize the ProcArray using this approach. Try not to think in terms of resizing, but dynamic allocation. I.e. being able to resize ProcArray (and thus being able to alter max_connections on the fly) would take a lot more work. Ju
Re: [HACKERS] upcoming wraps
On sön, 2010-08-29 at 23:14 -0400, Tom Lane wrote: > > And what about 9.1alpha1? > > Peter muttered something about doing that this week. The major blocker is preparing the release notes. If someone has time for that ... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_subtrans keeps bloating up in the standby
On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote: > On 27/08/10 20:17, Fujii Masao wrote: > > Yes. StartupXLOG calls that before bgwriter is invoked. That is, we can > > ensure that StartupSUBTRANS has always been done before bgwriter > > performs a restartpoint. > > Hmm, the comment in CreateCheckpoint() isn't totally accurate either: > > > * Truncate pg_subtrans if possible. We can throw away all data before > > * the oldest XMIN of any running transaction. No future transaction > > will > > * attempt to reference any pg_subtrans entry older than that (see Asserts > > * in subtrans.c). During recovery, though, we mustn't do this because > > * StartupSUBTRANS hasn't been called yet. Yep. > because in Hot Standby mode, StartSUBTRANS has been called already. We > could truncate pg_subtrans there too when hot standby is enabled. But > this is only about the startup checkpoint at the end of recovery, so I'm > inclined to not change that, not right now just before release anyway, > just in case we're missing something... > > However, is it safe to use GetOldestXMin() during recovery? Or to put it > other way, is GetOldestXMin() functioning correctly during hot standby? > It only scans through the ProcArray, but not the known-assigned xids > array. That seems like an oversight that needs to be fixed. Yes, thats correct. Otherwise the patch is fine. I'm working on this now and will commit something shortly. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_subtrans keeps bloating up in the standby
On 27/08/10 20:17, Fujii Masao wrote: Yes. StartupXLOG calls that before bgwriter is invoked. That is, we can ensure that StartupSUBTRANS has always been done before bgwriter performs a restartpoint. Hmm, the comment in CreateCheckpoint() isn't totally accurate either: * Truncate pg_subtrans if possible. We can throw away all data before * the oldest XMIN of any running transaction. No future transaction will * attempt to reference any pg_subtrans entry older than that (see Asserts * in subtrans.c). During recovery, though, we mustn't do this because * StartupSUBTRANS hasn't been called yet. because in Hot Standby mode, StartSUBTRANS has been called already. We could truncate pg_subtrans there too when hot standby is enabled. But this is only about the startup checkpoint at the end of recovery, so I'm inclined to not change that, not right now just before release anyway, just in case we're missing something... However, is it safe to use GetOldestXMin() during recovery? Or to put it other way, is GetOldestXMin() functioning correctly during hot standby? It only scans through the ProcArray, but not the known-assigned xids array. That seems like an oversight that needs to be fixed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers