date:20100830

Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE

2010-08-30 Thread Marko Tiikkaja


On 2010-08-31 12:07 AM +0300, I wrote:

I looked at fixing this..


The previous patch had a bug in fmgr_sql() our regression tests didn't 
catch.  Fixed version attached.



Regards,
Marko Tiikkaja
*** a/src/backend/catalog/pg_proc.c
--- b/src/backend/catalog/pg_proc.c
***
*** 832,838  fmgr_sql_validator(PG_FUNCTION_ARGS)

  proc->proargtypes.values,

  proc->pronargs);
(void) check_sql_fn_retval(funcoid, proc->prorettype,
!  
querytree_list,
   
NULL, NULL);
}
else
--- 832,838 

  proc->proargtypes.values,

  proc->pronargs);
(void) check_sql_fn_retval(funcoid, proc->prorettype,
!  
llast(querytree_list),
   
NULL, NULL);
}
else
*** a/src/backend/executor/functions.c
--- b/src/backend/executor/functions.c
***
*** 90,107  typedef struct
ParamListInfo paramLI;  /* Param list representing current args 
*/
  
Tuplestorestate *tstore;/* where we accumulate result tuples */
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   /* head of linked list of execution_state records */
!   execution_state *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static execution_state *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
--- 90,107 
ParamListInfo paramLI;  /* Param list representing current args 
*/
  
Tuplestorestate *tstore;/* where we accumulate result tuples */
+   Snapshotsnapshot;
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   List *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static List *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
***
*** 123,183  static void sqlfunction_destroy(DestReceiver *self);
  
  
  /* Set up the list of per-query execution_state records for a SQL function */
! static execution_state *
  init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK)
  {
!   execution_state *firstes = NULL;
!   execution_state *preves = NULL;
execution_state *lasttages = NULL;
!   ListCell   *qtl_item;
  
!   foreach(qtl_item, queryTree_list)
{
!   Query  *queryTree = (Query *) lfirst(qtl_item);
!   Node   *stmt;
!   execution_state *newes;
  
!   Assert(IsA(queryTree, Query));
  
!   if (queryTree->commandType == CMD_UTILITY)
!   stmt = queryTree->utilityStmt;
!   else
!   stmt = (Node *) pg_plan_query(queryTree, 0, NULL);
  
!   /* Precheck all commands for validity in a function */
!   if (IsA(stmt, TransactionStmt))
!   ereport(ERROR,
!   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
!   /* translator: %s is a SQL statement name */
!errmsg("%s is not allowed in a SQL 
function",
!   
CreateCommandTag(stmt;
  
!   if (fcache->readonly_func && !CommandIsReadOnly(stmt))
!   ereport(ERROR,
!   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
!   /* translator: %s is a SQL statement name */
!errmsg("%s is not allowed in a 
non-volatile function",
!   
CreateCommandTag(stmt;
  
!   newes = (execution_state *) palloc(sizeof(execution_st

Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE

2010-08-30 Thread Marko Tiikkaja


Hi,

I looked at fixing this inconsistency by making all query list snapshot 
handling work like EXPLAIN ANALYZE's code does.  The only reason I went 
this way was that implementing wCTEs on top of this behaviour is a lot 
easier.


There were three places that needed fixing.  The SPI and portal logic 
changes were quite straightforward, but the SQL language function code 
previously didn't know what query trees of the execution_state list 
belonged to which query so there was no way to tell when we actually 
needed to take a new snapshot.  The approach I took was to change the 
representation of the SQL function cache to a list of execution_state 
lists, and grab a new snapshot between the lists.


The patch needs a bit more comments and some cleaning up, but I thought 
I'd get your input first.  Thoughts?



Regards,
Marko Tiikkaja
*** a/src/backend/catalog/pg_proc.c
--- b/src/backend/catalog/pg_proc.c
***
*** 832,838  fmgr_sql_validator(PG_FUNCTION_ARGS)

  proc->proargtypes.values,

  proc->pronargs);
(void) check_sql_fn_retval(funcoid, proc->prorettype,
!  
querytree_list,
   
NULL, NULL);
}
else
--- 832,838 

  proc->proargtypes.values,

  proc->pronargs);
(void) check_sql_fn_retval(funcoid, proc->prorettype,
!  
llast(querytree_list),
   
NULL, NULL);
}
else
*** a/src/backend/executor/functions.c
--- b/src/backend/executor/functions.c
***
*** 90,107  typedef struct
ParamListInfo paramLI;  /* Param list representing current args 
*/
  
Tuplestorestate *tstore;/* where we accumulate result tuples */
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   /* head of linked list of execution_state records */
!   execution_state *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static execution_state *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
--- 90,107 
ParamListInfo paramLI;  /* Param list representing current args 
*/
  
Tuplestorestate *tstore;/* where we accumulate result tuples */
+   Snapshotsnapshot;
  
JunkFilter *junkFilter; /* will be NULL if function returns 
VOID */
  
!   List *func_state;
  } SQLFunctionCache;
  
  typedef SQLFunctionCache *SQLFunctionCachePtr;
  
  
  /* non-export function prototypes */
! static List *init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK);
  static void init_sql_fcache(FmgrInfo *finfo, bool lazyEvalOK);
***
*** 123,183  static void sqlfunction_destroy(DestReceiver *self);
  
  
  /* Set up the list of per-query execution_state records for a SQL function */
! static execution_state *
  init_execution_state(List *queryTree_list,
 SQLFunctionCachePtr fcache,
 bool lazyEvalOK)
  {
!   execution_state *firstes = NULL;
!   execution_state *preves = NULL;
execution_state *lasttages = NULL;
!   ListCell   *qtl_item;
  
!   foreach(qtl_item, queryTree_list)
{
!   Query  *queryTree = (Query *) lfirst(qtl_item);
!   Node   *stmt;
!   execution_state *newes;
  
!   Assert(IsA(queryTree, Query));
  
!   if (queryTree->commandType == CMD_UTILITY)
!   stmt = queryTree->utilityStmt;
!   else
!   stmt = (Node *) pg_plan_query(queryTree, 0, NULL);
  
!   /* Precheck all commands for validity in a function */
!   if (IsA(stmt, TransactionStmt))
!   ereport(ERROR,
!   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
!   /* translator: %s is a SQL statement name */
!e

Re: [HACKERS] How to construct an exact plan

2010-08-30 Thread Pavel Stehule

2010/8/30 Pei He :
> Hi,
> I am hacking postgresql 8.2.5. a) and b) do not work for me.
>
> The situation is that I made a join operator, and a scan operator.
> And, The join operator requires the scan operator as the inner. So, I
> need to have the full control of the join plan.
>
> I am not ready to provide the optimization support for the two new
> operators. And, I want to run some performance tests before working on
> the optimization part.
>
> So, I want to know if it is possible to directly create a path or a
> plan, and do a unit test for the operators.
>

yes, it is possible - but it isn't simple. I thing, so better is
simple implementation of all parts and then runtime blocking some (for
you not interesting) buildin methods via SET enable_ to off.

Regards

Pavel Stehule

>
> Thanks
> --
> Pei
>
> On Mon, Aug 30, 2010 at 1:59 PM, Josh Berkus  wrote:
>>
>>> I have developed a new operators, and I want to do some tests on it.
>>> I do not want the optimizer to choose the plan for me, and I need to
>>> construct a plan as exact as I want.
>>>
>>> Can anyone provide me a way to achieve that?
>>
>> a) easy: choose a simple enough query that its plan is always predictable.
>>
>> b) moderate: choose a query whose plan is predictable if you manipulate
>> the enable_* configuration settings
>>
>> c) hard: hack the PostgreSQL planner to choose a specific execution
>> plan, and recompile Postgres.
>>
>> --
>>                                  -- Josh Berkus
>>                                     PostgreSQL Experts Inc.
>>                                     http://www.pgexperts.com
>>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to construct an exact plan

2010-08-30 Thread Pei He

Hi,
I am hacking postgresql 8.2.5. a) and b) do not work for me.

The situation is that I made a join operator, and a scan operator.
And, The join operator requires the scan operator as the inner. So, I
need to have the full control of the join plan.

I am not ready to provide the optimization support for the two new
operators. And, I want to run some performance tests before working on
the optimization part.

So, I want to know if it is possible to directly create a path or a
plan, and do a unit test for the operators.

Thanks
--
Pei

On Mon, Aug 30, 2010 at 1:59 PM, Josh Berkus  wrote:
>
>> I have developed a new operators, and I want to do some tests on it.
>> I do not want the optimizer to choose the plan for me, and I need to
>> construct a plan as exact as I want.
>>
>> Can anyone provide me a way to achieve that?
>
> a) easy: choose a simple enough query that its plan is always predictable.
>
> b) moderate: choose a query whose plan is predictable if you manipulate
> the enable_* configuration settings
>
> c) hard: hack the PostgreSQL planner to choose a specific execution
> plan, and recompile Postgres.
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to construct an exact plan

2010-08-30 Thread Pei He

I forgot to mention that I am using postgresql 8.2.5.

Thanks
--
Pei

On Mon, Aug 30, 2010 at 1:34 PM, Pei He  wrote:
> Hi,
> I have developed a new operators, and I want to do some tests on it.
> I do not want the optimizer to choose the plan for me, and I need to
> construct a plan as exact as I want.
>
> Can anyone provide me a way to achieve that?
>
> Thanks
> --
> Pei
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] thousand unrelated data files in pg_default tablespace

2010-08-30 Thread Pavel Stehule

2010/8/30 Tom Lane :
> Pavel Stehule  writes:
>> I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the
>> data directory is bloating. There are more than one hundred thousand
>> files - 8KB or 0KB long. The filenames are not transformable to names
>> via oid2name. Does somebody know about similar bug?
>
> 1. 8.3.what?
>
postgres=# select version();
 version
--
 PostgreSQL 8.3.6 on x86_64-redhat-linux-gnu, compiled by GCC gcc
(GCC) 4.1.2 20071124 (Red Hat 4.1.2-42)
(1 row)

> 2. Any signs of distress in the postmaster log?  I'm wondering about
> being unable to complete checkpoints, or repeated backend crashes that
> might cause leakage of temp tables.

No, there are nothing

>
> 3. What's in the files --- do they appear to be tables, indexes, random
> temp files from sorts/hashes, or what?  pg_filedump might help you here.
>

I have to contact admin tomorrow. For now - one half was zero length,
second half was almost empty. These files are in directory related to
pg_default tablespace.

Regards

Pavel Stehule

>                        regards, tom lane
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] How to construct an exact plan

2010-08-30 Thread Pei He

Hi,
I have developed a new operators, and I want to do some tests on it.
I do not want the optimizer to choose the plan for me, and I need to
construct a plan as exact as I want.

Can anyone provide me a way to achieve that?

Thanks
--
Pei

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] thousand unrelated data files in pg_default tablespace

2010-08-30 Thread Tom Lane

Pavel Stehule  writes:
> I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the
> data directory is bloating. There are more than one hundred thousand
> files - 8KB or 0KB long. The filenames are not transformable to names
> via oid2name. Does somebody know about similar bug?

1. 8.3.what?

2. Any signs of distress in the postmaster log?  I'm wondering about
being unable to complete checkpoints, or repeated backend crashes that
might cause leakage of temp tables.

3. What's in the files --- do they appear to be tables, indexes, random
temp files from sorts/hashes, or what?  pg_filedump might help you here.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Tom Lane

I wrote:
> I guess that something isn't properly setting up rnode.backend in
> recovery processing, but didn't find it yet.

CreateFakeRelcacheEntry is the culprit ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Tom Lane

"Kevin Grittner"  writes:
> LOG:  database system was interrupted; last known up at 2010-08-30
> 09:13:23 CDT
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  consistent recovery state reached at 0/5C5D04
> LOG:  redo starts at 0/5C5D04
> TRAP: FailedAssertion("!(backend == (-1))", File: "catalog.c", Line:
> 120)
> LOG:  startup process (PID 5338) was terminated by signal 6: Aborted
> LOG:  aborting startup due to startup process failure

I can reproduce this; looks like it is fallout from the
RelFileNodeBackend patch.  Stack trace from core dump is

#4  0x4f912c in ExceptionalCondition (
conditionName=0x89054 "!(backend == (-1))", 
errorType=0x89044 "FailedAssertion", fileName=0x88fc8 "catalog.c", 
lineNumber=120) at assert.c:57
#5  0x2473b8 in relpathbackend (rnode={spcNode = 1664, dbNode = 0, 
  relNode = 11617}, backend=2063810256, forknum=2063672808)
at catalog.c:120
#6  0x40a628 in mdopen (reln=0x40096748, forknum=VISIBILITYMAP_FORKNUM, 
behavior=EXTENSION_RETURN_NULL) at md.c:508
#7  0x409ce0 in mdexists (reln=0x40096748, forkNum=VISIBILITYMAP_FORKNUM)
at md.c:237
#8  0x40c7ac in smgrexists (reln=0x7b011c28, forknum=2063848444) at smgr.c:207
#9  0x1f2464 in vm_readbuf (rel=0x40061bc0, blkno=0, extend=0 '\000')
at visibilitymap.c:410
#10 0x1f1b80 in visibilitymap_clear (rel=0x7b011c28, heapBlk=2063848444)
at visibilitymap.c:147
#11 0x1e924c in heap_xlog_insert (lsn={xlogid = 0, xrecoff = 26015704}, 
record=0x30) at heapam.c:4389
#12 0x1eaec8 in heap_redo (lsn={xlogid = 0, xrecoff = 26015704}, 
record=0x40083f70) at heapam.c:4823
#13 0x21bbd4 in StartupXLOG () at xlog.c:6229
#14 0x2215f8 in StartupProcessMain () at xlog.c:9233
#15 0x24571c in AuxiliaryProcessMain (argc=2, argv=0x7b03ade8)
at bootstrap.c:412
#16 0x3cb560 in StartChildProcess (type=StartupProcess) at postmaster.c:4407
#17 0x3c6f6c in PostmasterMain (argc=3, argv=0x7b03aba0) at postmaster.c:1089
#18 0x34d998 in main (argc=3, argv=0x7b03aba0) at main.c:188

I guess that something isn't properly setting up rnode.backend in
recovery processing, but didn't find it yet.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] thousand unrelated data files in pg_default tablespace

2010-08-30 Thread Pavel Stehule

Hello

I found a PostgreSQL 8.3 server (Linux) used for large OLAP where the
data directory is bloating. There are more than one hundred thousand
files - 8KB or 0KB long. The filenames are not transformable to names
via oid2name. Does somebody know about similar bug?

Regards

Pavel Stehule

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] string function - "format" function proposal

2010-08-30 Thread Pavel Stehule

2010/8/30 Alvaro Herrera :
> Excerpts from Pavel Stehule's message of lun ago 30 07:51:55 -0400 2010:
>> 2010/8/30 Itagaki Takahiro :
>> > On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule  
>> > wrote:
>> >> propsals:
>> >> * "format" function - uses same formatting as PL/pgSQL RAISE statement
>> >> * "sprintf" function
>> >>
>> >> Now I propose a compromise - "format" function with only three tags:
>> >> %s .. some string
>> >> %i  .. SQL identifier
>> >> %l  .. string literal
>> >
>> > These are just ideas:
>> >
>> > * Use $n, as like as PREPARE command.
>> >  It allows for us to swap arguments in any order.
>> >  SELECT format('$2 before $1', 'aaa', 'bbb')
>>
>> what is use case for this feature? I don't see it.
>
> Translations :-)  I haven't had a use for that but I've heard people
> implements gettext of sorts in database tables.  Maybe that kind of
> thing would be of use here.
>
>> > * Call to_char() functions for each placeholder.
>> >  For example,
>> >    format('=={-MM-DD}==', tm::timestamp)
>> >  is equivalent to
>> >    '==' || to_char(tm, '-MM-DD') || '=='
>> >  '{}' prints the input with the default format.
>> >
>> > New languages' libraries might be of some help. LLs, C#, etc.
>>
>> I though about integration with to_char function too. There are not
>> technical barrier. And I can live with just {to_char_format} too. It
>> can be or cannot be mixed with basic tags together - there is
>> specified a NULL value behave. If we allow {format} syntax, then we
>> have to specify a escape syntax for { and }. Do you have a some idea?
>
> What about %{sth}?  That way you don't need to escape {.  The closing } would
> need escaping only inside the %{} specifier, so {%{{\}MM}} prints
> {2010{}08}  So the above example is:

then you need escaping too :)

>
> format('==%{-MM-DD}==', tm::timestamp);

I am not sure if this is correct -but why not

so there are possible combinations

%s   .. no quoting, NULL is ''
%{}  .. no quoting, NULL is NULL .. like output from to_char
%{}s .. no quoting with formatting, NULL is ''

now I have not idea about nice syntax for positional parameters - maybe
%{...}$1s or we can use a two variants for tags - not positional '%'
and positional '%', so
$1{...}s, %{...}s, $1, %s, $1s, $1{...}, %{...} can be valid tags

Regards

Pavel Stehule

>
> Not sure about this to_char stuff though, seems too cute.  You can do
> the case above like this:
>
> format('==%s==', to_char(tm::timestamp, '-MM-DD'))
>

I like an using a format like tag - there are not technical problem -
format can be taken from string and data type parameter can be known
too. But this feature can be some enhancing. The basic features are
NULL handling and right quoting.



> --
> Álvaro Herrera 
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

2010-08-30 Thread Markus Wanner


Hi,

On 08/30/2010 04:52 PM, Tom Lane wrote:

Let me just point out that awhile back we got a *measurable* performance
boost by eliminating a single indirect fetch from the buffer addressing
code path.


I'll take a look a that, thanks.


So I don't have any faith in untested assertions


Neither do I. Thus I'm probably going to try my approach.

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Tom Lane

Joe Conway  writes:
> I'm seeing it on a checkout from CVS head. Basically do

Oh, OK, I misread Kevin to say this would happen during make
installcheck itself.  So the lack of buildfarm reports is not relevant.

Will get some caffeine in me and then take a look.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Joe Conway

On 08/30/2010 08:16 AM, Tom Lane wrote:
> "Kevin Grittner"  writes:
>> Checking out HEAD from the git repository I see an assertion
>> failure.
> 
> The buildfarm isn't reporting any such thing.  Could you get a CVS
> checkout and diff it against the git results?

I'm seeing it on a checkout from CVS head. Basically do

 make
 make install
 initdb
 pg_ctl start
 make installcheck
 pg_ctl -m immediate stop
 pg_ctl start

At this point I see the assertion in the logs.

Joe

-- 
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] string function - "format" function proposal

2010-08-30 Thread Alvaro Herrera

Excerpts from Pavel Stehule's message of lun ago 30 07:51:55 -0400 2010:
> 2010/8/30 Itagaki Takahiro :
> > On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule  
> > wrote:
> >> propsals:
> >> * "format" function - uses same formatting as PL/pgSQL RAISE statement
> >> * "sprintf" function
> >>
> >> Now I propose a compromise - "format" function with only three tags:
> >> %s .. some string
> >> %i  .. SQL identifier
> >> %l  .. string literal
> >
> > These are just ideas:
> >
> > * Use $n, as like as PREPARE command.
> >  It allows for us to swap arguments in any order.
> >  SELECT format('$2 before $1', 'aaa', 'bbb')
> 
> what is use case for this feature? I don't see it.

Translations :-)  I haven't had a use for that but I've heard people
implements gettext of sorts in database tables.  Maybe that kind of
thing would be of use here.

> > * Call to_char() functions for each placeholder.
> >  For example,
> >    format('=={-MM-DD}==', tm::timestamp)
> >  is equivalent to
> >    '==' || to_char(tm, '-MM-DD') || '=='
> >  '{}' prints the input with the default format.
> >
> > New languages' libraries might be of some help. LLs, C#, etc.
> 
> I though about integration with to_char function too. There are not
> technical barrier. And I can live with just {to_char_format} too. It
> can be or cannot be mixed with basic tags together - there is
> specified a NULL value behave. If we allow {format} syntax, then we
> have to specify a escape syntax for { and }. Do you have a some idea?

What about %{sth}?  That way you don't need to escape {.  The closing } would
need escaping only inside the %{} specifier, so {%{{\}MM}} prints
{2010{}08}  So the above example is:

format('==%{-MM-DD}==', tm::timestamp);

Not sure about this to_char stuff though, seems too cute.  You can do
the case above like this:

format('==%s==', to_char(tm::timestamp, '-MM-DD'))

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Kevin Grittner

Tom Lane  wrote:
 
> Could you get a CVS checkout and diff it against the git results?
 
Will do.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Tom Lane

"Kevin Grittner"  writes:
> Checking out HEAD from the git repository I see an assertion
> failure.

The buildfarm isn't reporting any such thing.  Could you get a CVS
checkout and diff it against the git results?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

2010-08-30 Thread Tom Lane

Markus Wanner  writes:
> AFAICT we currently have three fixed size blocks to manage shared 
> buffers: the buffer blocks themselves, the buffer descriptors, the 
> strategy status (for the freelist) and the buffer lookup table.

> It's not obvious to me how these data structures should perform better 
> than a dynamically allocated layout.

Let me just point out that awhile back we got a *measurable* performance
boost by eliminating a single indirect fetch from the buffer addressing
code path.  We used to have an array of pointers pointing to the actual
buffers, and we removed that in favor of assuming the buffers were
laid out in a contiguous array, so that the address of buffer N could be
computed with a shift-and-add, eliminating the pointer fetch.  I forget
exactly what the numbers were, but it was significant enough to make us
change it.

So I don't have any faith in untested assertions that we can convert
these data structures to use dynamic allocation with no penalty.
It's very difficult to see how you'd do that without introducing a
new layer of indirection, and our experience shows that that layer
will cost you.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Assertion failure on HEAD (or at least git copy of it)

2010-08-30 Thread Kevin Grittner

Checking out HEAD from the git repository I see an assertion
failure.  Steps to reproduce:
 
make distclean
./configure --prefix=/usr/local/pgsql-serializable \
  --enable-debug --enable-depend --enable-cassert
make check
sudo make install
cd contrib/
make
sudo make install
cd ../
rm -fr /var/pgsql/data/serializable/*
initdb /var/pgsql/data/serializable
/usr/local/pgsql-serializable/pg_ctl \
  -D /var/pgsql/data/serializable -l logfile start
make installcheck
# wait 30 to 60 seconds
/usr/local/pgsql-serializable/pg_ctl \
  -D /var/pgsql/data/serializable -m immediate stop
/usr/local/pgsql-serializable/pg_ctl \
  -D /var/pgsql/data/serializable -l logfile start
 
The postmaster is not running.  The tail of logfile is:
 
LOG:  wait_for_stats delayed 0.013267 seconds
STATEMENT:  SELECT wait_for_stats();
LOG:  received immediate shutdown request
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  database system was interrupted; last known up at 2010-08-30
09:13:23 CDT
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  consistent recovery state reached at 0/5C5D04
LOG:  redo starts at 0/5C5D04
TRAP: FailedAssertion("!(backend == (-1))", File: "catalog.c", Line:
120)
LOG:  startup process (PID 5338) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure
 
Joe found this while beating up on the SSI patch, but I'm also
seeing it on unpatched HEAD, as checked out from git.  This is 32
bit kubuntu and ubuntu (different versions).
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cost_hashjoin

2010-08-30 Thread Simon Riggs

On Mon, 2010-08-30 at 13:34 +0100, Greg Stark wrote:
> On Mon, Aug 30, 2010 at 10:18 AM, Simon Riggs  wrote:
> > cost_hashjoin() has some treatment of what occurs when numbatches > 1
> > but that additional cost is not proportional to numbatches.
> 
> Because that's not how our hash batching works. We generate two temp
> files for each batch, one for the outer and one for the inner. So if
> we're batching then every tuple of both the inner and outer tables
> (except for ones in the first batch) need to be written once and read
> once regardless of the number of batches.

Thanks for explaining. For some reason I thought we rewound the outer at
the start of each batch, which is better for avoiding cache spoiling.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cost_hashjoin

2010-08-30 Thread Greg Stark

On Mon, Aug 30, 2010 at 10:18 AM, Simon Riggs  wrote:
> cost_hashjoin() has some treatment of what occurs when numbatches > 1
> but that additional cost is not proportional to numbatches.

Because that's not how our hash batching works. We generate two temp
files for each batch, one for the outer and one for the inner. So if
we're batching then every tuple of both the inner and outer tables
(except for ones in the first batch) need to be written once and read
once regardless of the number of batches.

I do think the hash join implementation is a good demonstration of why
C programming is faster at a micro-optimization level but slower at a
macro level. Users of higher level languages would be much more likely
to use any of the many fancier hashing data structures developed in
the last few decades. in particular I think Cuckoo hashing would be
interesting for us.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

2010-08-30 Thread Fujii Masao

On Mon, Aug 30, 2010 at 6:18 PM, Simon Riggs  wrote:
> On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote:
>
>> However, is it safe to use GetOldestXMin() during recovery? Or to put it
>> other way, is GetOldestXMin() functioning correctly during hot standby?
>> It only scans through the ProcArray, but not the known-assigned xids
>> array. That seems like an oversight that needs to be fixed.
>
> Patch to implement that attached.

The type of the value which KnownAssignedXidsGetOldestXmin returns should
be TransactionId rather than int?

Does caller of KnownAssignedXidsGetOldestXmin need to hold ProcArrayLock
in (at least) shared mode? If yes, it's helpful to add that note as a comment.

> Will then commit Fujii's patch.

Thanks.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

register/unregister standby Re: [HACKERS] Synchronous replication

2010-08-30 Thread Fujii Masao

On Tue, Aug 10, 2010 at 5:58 AM, Heikki Linnakangas
 wrote:
> On 05/08/10 17:14, Fujii Masao wrote:
>>
>> I'm thinking to make users register and unregister each standbys via SQL
>> functions like register_standby() and unregister_standby():
>
> The register/unregister facility should be accessible from the streaming
> replication connection, so that you don't need to connect to any particular
> database in addition to the streaming connection.

Probably I've not understood your point correctly yet.

I think that the advantage of registering standbys is that we can
specify which WAL files the master has to keep for the upcoming
standby. IMO, it's usually called together with pg_start_backup
as follows:

SELECT register_standby('foo', pg_start_backup())

This requests the master keep to all the WAL files following the
backup starting location which pg_start_backup returns. Now we
can do that by using wal_keep_segments, but it's not easy to set
because it's difficult to predict how many WAL files the standby
will require.

So I've thought that the register/unregister facility should be
used from the normal client connection. Why do you think it should
be accessible from the SR connection?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] string function - "format" function proposal

2010-08-30 Thread Pavel Stehule

2010/8/30 Itagaki Takahiro :
> On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule  
> wrote:
>> propsals:
>> * "format" function - uses same formatting as PL/pgSQL RAISE statement
>> * "sprintf" function
>>
>> Now I propose a compromise - "format" function with only three tags:
>> %s .. some string
>> %i  .. SQL identifier
>> %l  .. string literal
>
> These are just ideas:
>
> * Use $n, as like as PREPARE command.
>  It allows for us to swap arguments in any order.
>  SELECT format('$2 before $1', 'aaa', 'bbb')
>

what is use case for this feature? I don't see it.

> * Call to_char() functions for each placeholder.
>  For example,
>    format('=={-MM-DD}==', tm::timestamp)
>  is equivalent to
>    '==' || to_char(tm, '-MM-DD') || '=='
>  '{}' prints the input with the default format.
>
> New languages' libraries might be of some help. LLs, C#, etc.

I though about integration with to_char function too. There are not
technical barrier. And I can live with just {to_char_format} too. It
can be or cannot be mixed with basic tags together - there is
specified a NULL value behave. If we allow {format} syntax, then we
have to specify a escape syntax for { and }. Do you have a some idea?

Regards

Pavel



>
> --
> Itagaki Takahiro
>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] string function - "format" function proposal

2010-08-30 Thread Itagaki Takahiro

On Mon, Aug 30, 2010 at 7:58 PM, Pavel Stehule  wrote:
> propsals:
> * "format" function - uses same formatting as PL/pgSQL RAISE statement
> * "sprintf" function
>
> Now I propose a compromise - "format" function with only three tags:
> %s .. some string
> %i  .. SQL identifier
> %l  .. string literal

These are just ideas:

* Use $n, as like as PREPARE command.
  It allows for us to swap arguments in any order.
  SELECT format('$2 before $1', 'aaa', 'bbb')

* Call to_char() functions for each placeholder.
  For example,
format('=={-MM-DD}==', tm::timestamp)
  is equivalent to
'==' || to_char(tm, '-MM-DD') || '=='
  '{}' prints the input with the default format.

New languages' libraries might be of some help. LLs, C#, etc.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] string function - "format" function proposal

2010-08-30 Thread Pavel Stehule

Hello

I am returning back to string functions. For me, the most important
function isn't commited still. There was discussion about "format" or
"sprintf" fuction. So I'll do a small resume.

goal: to get function that helps with formatting a message texts and
helps with building a SQL commands (used as dynamic SQL)

propsals:

* "format" function - uses same formatting as PL/pgSQL RAISE statement
* "sprintf" function

Itagaki objectives to "format" function:
* there are not possibility put two parameters without a space between
* it is too simple

My objectives to sprintf function:
* it is designed to different environment than SQL - missing support
NULL, missing support for date, timestamp, boolean, ...
* it is too complex, some parameters has different meaning for different tags
* we have a "to_char" function for complex formatting now.

Now I propose a compromise - "format" function with only three tags:

%s .. some string
%i  .. SQL identifier
%l  .. string literal

using a NULL:

for %s NULL is transformed to empty string - like "concat"
for %i NULL raises an exception
for %l NULL is transformed to ' NULL '  string.

This system is still simple and enough.

Implemented sprintf function can be moved to "sprintf" contrib module.

comments

Regards

Pavel Stehule

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] cost_hashjoin

2010-08-30 Thread Simon Riggs


cost_hashjoin() has some treatment of what occurs when numbatches > 1
but that additional cost is not proportional to numbatches.

The associated comment talks about "an extra time", making it sound like
we think numbatches would only ever be 2 (or 1).

Can someone explain the current code, or is there an oversight there?

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exposing the Xact commit order to the user

2010-08-30 Thread Simon Riggs

On Sun, 2010-05-23 at 16:21 -0400, Jan Wieck wrote:

> In some systems (data warehousing, replication), the order of commits is
> important, since that is the order in which changes have become visible.
> This information could theoretically be extracted from the WAL, but
> scanning the entire WAL just to extract this tidbit of information would
> be excruciatingly painful.

This idea had support from at least 6 hackers. I'm happy to add my own.

Can I suggest it is added as a hook, rather than argue about the details
too much? The main use case is in combination with external systems, so
that way we can maintain the relevant code with the system that cares
about it.

> CommitTransaction() inside of xact.c will call a function, that inserts
> a new record into this array. The operation will for most of the time be
> nothing than taking a spinlock and adding the record to shared memory.
> All the data for the record is readily available, does not require
> further locking and can be collected locally before taking the spinlock.
> The begin_timestamp is the transactions idea of CURRENT_TIMESTAMP, the
> commit_timestamp is what CommitTransaction() just decided to write into
> the WAL commit record and the total_rowcount is the sum of inserted,
> updated and deleted heap tuples during the transaction, which should be
> easily available from the statistics collector, unless row stats are
> disabled, in which case the datum would be zero.

Does this need to be called while in a critical section? Or can we wait
until after the actual marking of the commit before calling this?

> Checkpoint handling will call a function to flush the shared buffers.
> Together with this, the information from WAL records will be sufficient
> to recover this data (except for row counts) during crash recovery.

So it would need to work identically in recovery also?

These two values are not currently stored in the commit WAL record.

timestamptz   xci_begin_timestamp
int64 xci_total_rowcount

Both of those seem optional, so I don't really want them added to WAL.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

2010-08-30 Thread Simon Riggs

On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote:

> However, is it safe to use GetOldestXMin() during recovery? Or to put it 
> other way, is GetOldestXMin() functioning correctly during hot standby? 
> It only scans through the ProcArray, but not the known-assigned xids 
> array. That seems like an oversight that needs to be fixed.

Patch to implement that attached.

This is necessary since CreateCheckpoint is called during end of
recovery, though at that point there are still xids in KnownAssignedXids
since they aren't removed until slightly later. Not hugely important.

Also allows GetOldestXmin to be safely called elsewhere, such as Fujii's
earlier patch on this thread.

Any objections to commit to both head and 9.0?

Will then commit Fujii's patch.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bba247b..7c93f1e 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -156,6 +156,7 @@ static int	KnownAssignedXidsGet(TransactionId *xarray, TransactionId xmax);
 static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 			   TransactionId *xmin,
 			   TransactionId xmax);
+static int KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 
 /*
@@ -1112,6 +1113,18 @@ GetOldestXmin(bool allDbs, bool ignoreVacuum)
 		}
 	}
 
+	if (RecoveryInProgress())
+	{
+		/*
+		 * Check to see whether KnownAssignedXids contains an xid value
+		 * older than the main procarray.
+		 */
+		TransactionId kaxmin = KnownAssignedXidsGetOldestXmin();
+		if (TransactionIdIsNormal(kaxmin) && 
+			TransactionIdPrecedes(kaxmin, result))
+result = kaxmin;
+	}
+
 	LWLockRelease(ProcArrayLock);
 
 	/*
@@ -3015,6 +3028,33 @@ KnownAssignedXidsGetAndSetXmin(TransactionId *xarray, TransactionId *xmin,
 	return count;
 }
 
+static int
+KnownAssignedXidsGetOldestXmin(void)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile ProcArrayStruct *pArray = procArray;
+	int			head,
+tail;
+	int			i;
+
+	/*
+	 * Fetch head just once, since it may change while we loop.
+	 */
+	SpinLockAcquire(&pArray->known_assigned_xids_lck);
+	tail = pArray->tailKnownAssignedXids;
+	head = pArray->headKnownAssignedXids;
+	SpinLockRelease(&pArray->known_assigned_xids_lck);
+
+	for (i = tail; i < head; i++)
+	{
+		/* Skip any gaps in the array */
+		if (KnownAssignedXidsValid[i])
+			return KnownAssignedXids[i];
+	}
+
+	return InvalidTransactionId;
+}
+
 /*
  * Display KnownAssignedXids to provide debug trail
  *

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

2010-08-30 Thread Markus Wanner

(Sorry, need to disable Ctrl-Return, which quite often sends mails 
earlier than I really want.. continuing my mail)


On 08/27/2010 10:46 PM, Robert Haas wrote:

Yeah, probably.  I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.


Agreed. Thus I've left out any kind of congestion avoidance stuff from 
imessages so far.



There's no padding or sophisticated allocation needed.  You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3).  Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.


That would block parallel writers (i.e. only one process can write to the
queue at any time).


I feel like there's probably some variant of this idea that works
around that problem.  The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime.  I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.


I've tried pretty much that before. And failed. Because the 
allocation-order (i.e. the time the message gets created in preparation 
for writing to it) isn't necessarily the same as the sending-order (i.e. 
when the process has finished writing and decides to send the message).


To satisfy the FIFO property WRT the sending order, you need to decouple 
allocation form the ordering (i.e. queuing logic).


(And yes, it has taken me a while to figure out what's wrong in 
Postgres-R, before I've even noticed about that design bug).



Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.


It would require copying data in case a process only needs to forward the
message. That's a quick pointer dequeue and enqueue exercise ATM.


If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend.


Absolutely, yes.


But I'm not sure I
see why we would need that sort of capability.  Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?


A process can read and even change the data of the message before 
forwarding it. Something the coordinator in Postgres-R does sometimes. 
(As it is the interface to the GCS and thus to the rest of the nodes in 
the cluster).


For parallel querying (on a single node) that's probably less important 
a feature.



So, they know in advance how large the message will be but not what
the contents will be?  What are they doing?


Filling the message until it's (mostly) full and then continue with the 
next one. At least that's how the streaming approach on top of imessages 
works.


But yes, it's somewhat annoying to have to know the message size in 
advance. I didn't implement realloc so far. Nor can I think of any other 
solution. Note that separation of allocation and queue ordering is 
required anyway for the above reasons.



Well, the fact that something is commonly used doesn't mean it's right
for us.  Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly.  Hopefully the
above comments shed some light on my concerns.  In short, (1) I don't
want to preallocate a big chunk of memory we might not use,


Isn't that's exactly what we do now for lots of sub-systems, and what 
I'd like to improve (i.e. reduce to a single big chunk).



(2) I fear
reducing the overall robustness of the system, and


Well, that applies to pretty much every new feature you add.


(3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.


Okay, that's up to me to show evidences (or at least a PoC).

Regards

Markus Wanner

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

2010-08-30 Thread Markus Wanner


Hi,

On 08/27/2010 10:46 PM, Robert Haas wrote:

What other subsystems are you imagining servicing with a dynamic
allocator?  If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case.  We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.


Understood. I certainly plan to look into that for a better 
understanding of the problems those pose for dynamically allocated memory.



I think it would be great if we could bring some more flexibility to
our memory management.  There are really two layers of problems here.


Full ACK.


One is resizing the segment itself, and one is resizing structures
within the segment.  As far as I can tell, there is no portable API
that can be used to resize the shm itself.  For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable.  So we need to
solve that problem first.


Why should resizing of the objects within the shmem be unworkable? 
Doesn't my patch(es) prove the exact opposite? Being able to resize 
"objects" within the shm requires some kind of underlying dynamic 
allocation. And I rather like to be in control of that allocator than 
having to deal with two dozen different implementations on different 
OSes and their libraries.



There are a couple of possible solutions, which have been discussed
here in the past.


I currently don't have much interest in dynamic resizing. Being able to 
resize the overall amount of shared memory on the fly would be nice, 
sure. But the total amount of RAM in a server changes rather 
infrequently. Being able to use what's available more efficiently is 
what I'm interested in. That doesn't need any kind of additional or 
different OS level support. It's just a matter of making better use of 
what's available - within Postgres itself.



Next, we have to think about how we're going to resize data structures
within this expandable shm.


Okay, that's where I'm getting interested.


Many of these structures are not things
that we can easily move without bringing the system to a halt.  For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray.


I guess that's what Bruce wanted to point out by saying our data 
structures are mostly "continuous". I.e. not dynamic lists or hash 
tables, but plain simple arrays.


Maybe that's a subjective impression, but I seem to hear complaints 
about their fixed size and inflexibility quite often. Try to imagine the 
flexibility that dynamic lists could give us.



And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory.  One possible solution
is to divide up these data structures into "slabs".  For example, we
might imagine allocating shared_buffers in 1GB chunks.


Why 1GB and do yet another layer of dynamic allocation within that? The 
buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit 
more for all of the book-keeping stuff.



To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors.  That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset.


AFAICT we currently have three fixed size blocks to manage shared 
buffers: the buffer blocks themselves, the buffer descriptors, the 
strategy status (for the freelist) and the buffer lookup table.


It's not obvious to me how these data structures should perform better 
than a dynamically allocated layout. One could rather argue that 
combining (some of) the bookkeeping stuff with data itself would lead to 
better locality and thus perform better.



Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk.  Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free.  There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.


This sounds way too complicated, yes. Use 8K chunks and most of the 
problems vanish.



I think the problems for other arrays are similar, or more severe.  I
can't see, for example, how you could resize the ProcArray using this
approach.


Try not to think in terms of resizing, but dynamic allocation. I.e. 
being able to resize ProcArray (and thus being able to alter 
max_connections on the fly) would take a lot more work.


Ju

Re: [HACKERS] upcoming wraps

2010-08-30 Thread Peter Eisentraut

On sön, 2010-08-29 at 23:14 -0400, Tom Lane wrote:
> > And what about 9.1alpha1?
> 
> Peter muttered something about doing that this week.

The major blocker is preparing the release notes.  If someone has time
for that ...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

2010-08-30 Thread Simon Riggs

On Mon, 2010-08-30 at 09:59 +0300, Heikki Linnakangas wrote:
> On 27/08/10 20:17, Fujii Masao wrote:
> > Yes. StartupXLOG calls that before bgwriter is invoked. That is, we can
> > ensure that StartupSUBTRANS has always been done before bgwriter
> > performs a restartpoint.
> 
> Hmm, the comment in CreateCheckpoint() isn't totally accurate either:
> 
> >  * Truncate pg_subtrans if possible.  We can throw away all data before
> >  * the oldest XMIN of any running transaction.  No future transaction 
> > will
> >  * attempt to reference any pg_subtrans entry older than that (see Asserts
> >  * in subtrans.c).  During recovery, though, we mustn't do this because
> >  * StartupSUBTRANS hasn't been called yet.

Yep.

> because in Hot Standby mode, StartSUBTRANS has been called already. We 
> could truncate pg_subtrans there too when hot standby is enabled. But 
> this is only about the startup checkpoint at the end of recovery, so I'm 
> inclined to not change that, not right now just before release anyway, 
> just in case we're missing something...
> 
> However, is it safe to use GetOldestXMin() during recovery? Or to put it 
> other way, is GetOldestXMin() functioning correctly during hot standby? 
> It only scans through the ProcArray, but not the known-assigned xids 
> array. That seems like an oversight that needs to be fixed.

Yes, thats correct. Otherwise the patch is fine.

I'm working on this now and will commit something shortly.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

2010-08-30 Thread Heikki Linnakangas


On 27/08/10 20:17, Fujii Masao wrote:

Yes. StartupXLOG calls that before bgwriter is invoked. That is, we can
ensure that StartupSUBTRANS has always been done before bgwriter
performs a restartpoint.


Hmm, the comment in CreateCheckpoint() isn't totally accurate either:


 * Truncate pg_subtrans if possible.  We can throw away all data before
 * the oldest XMIN of any running transaction.  No future transaction will
 * attempt to reference any pg_subtrans entry older than that (see Asserts
 * in subtrans.c).  During recovery, though, we mustn't do this because
 * StartupSUBTRANS hasn't been called yet.


because in Hot Standby mode, StartSUBTRANS has been called already. We 
could truncate pg_subtrans there too when hot standby is enabled. But 
this is only about the startup checkpoint at the end of recovery, so I'm 
inclined to not change that, not right now just before release anyway, 
just in case we're missing something...


However, is it safe to use GetOldestXMin() during recovery? Or to put it 
other way, is GetOldestXMin() functioning correctly during hot standby? 
It only scans through the ProcArray, but not the known-assigned xids 
array. That seems like an oversight that needs to be fixed.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE

Re: [HACKERS] Rewrite, normal execution vs. EXPLAIN ANALYZE

Re: [HACKERS] How to construct an exact plan

Re: [HACKERS] How to construct an exact plan

Re: [HACKERS] How to construct an exact plan

Re: [HACKERS] thousand unrelated data files in pg_default tablespace

[HACKERS] How to construct an exact plan

Re: [HACKERS] thousand unrelated data files in pg_default tablespace

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

[HACKERS] thousand unrelated data files in pg_default tablespace

Re: [HACKERS] string function - "format" function proposal

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] string function - "format" function proposal

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

[HACKERS] Assertion failure on HEAD (or at least git copy of it)

Re: [HACKERS] cost_hashjoin

Re: [HACKERS] cost_hashjoin

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

register/unregister standby Re: [HACKERS] Synchronous replication

Re: [HACKERS] string function - "format" function proposal

Re: [HACKERS] string function - "format" function proposal

[HACKERS] string function - "format" function proposal

[HACKERS] cost_hashjoin

Re: [HACKERS] Exposing the Xact commit order to the user

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

Re: [HACKERS] bg worker: patch 1 of 6 - permanent process

Re: [HACKERS] upcoming wraps

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

Re: [HACKERS] pg_subtrans keeps bloating up in the standby

35 matches

Site Navigation

Mail list logo

Footer information