Re: [HACKERS] per-column FDW options, v5

2011-08-08 Thread Shigeru Hanada
Sorry, I've missed sending copy to list, so I quoted off-list discussion.

 On Aug 5, 2011, at 7:59 PM, Shigeru Hanadashigeru.han...@gmail.com
wrote:

 2011/8/6 Robert Haasrobertmh...@gmail.com:
 Done.

 Thanks!

 Incidentally, I notice that if you do:

 \d some_foreign_table

 ...the table-level options are not displayed.  We probably ought to do
 something about that...

 Currently table-level options are showin in result of \det+ command
 (only verbose mode), in same style as fdw and foreign servers.

 But \d is more popular for table describing, so moving table-level
 options from \det+ to \d might be better.  Thoughts?

(2011/08/06 9:26), Robert Haas wrote:
 I'd show it both places.

After taking a look at describe.c, I think some styles are applicable to
FDW options in result of \d command.

(1) simple array literal style
Show table-level FDW options like other FDW options.  It is  simply a
result of array_out(); each options is shown as key=value with quoting
if necessary and delimited by ','.  Whole line is surrounded by { and }.
 If an element includes any character which need to be escaped, such
element is quoted with double-quotation.

Ex)
FDW Options: {delimiter=,,quote=\}
#delimiter is a comma, and qutoe is a double-quote

(2) reloptions style
Show FDW options like reloptions of \d+ result.  Each options is shown
as key=value without quoting.  Some special characters might make it
little illegible.

Ex)
FDW Options: delimiter=,, quote=
#delimiter is a comma, and qutoe is a double-quote

(3) OPTIONS clause style
Show FDW options as they were in OPTIONS clause.  Each option is shown
as key 'value', and delimited with ','.

Ex)
FDW Options: delimiter ',', quote 
#delimiter is a comma, and qutoe is a single-quote

(1) can be implemented with minimum change, and it also keeps the
behavior consistent with other existing FDW objects.  But IMHO (3) is
most readable, though it breaks backward compatibility about the format
of FDW options used in the result of \d* command.  Thoughts?

BTW, I found wrong description about per-table FDW options; psql
document says that \d+ shows table-level FDW options, but actually it
doesn't show.  I'll post this issue in another new thread.

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] force_not_null option support for file_fdw

2011-08-08 Thread Shigeru Hanada
Hi,

I propose to support force_not_null option for file_fdw too.

In 9.1 development cycle, file_fdw had been implemented with exported
COPY FROM routines, but only force_not_null option has not been 
supported yet.

Originally, in COPY FROM, force_not_null is specified as a list of
column which is not matched against null-string.  Now per-column FDW
option is available, so I implemented force_not_null options as boolean
value for each column to avoid parsing column list in file_fdw.

True means that the column is not matched against null-string, and it's
equivalent to specify the column's name in force_not_null option of COPY
FROM.  Default value is false.

The patch includes changes for code, document and regression tests.  Any
comments/questions are welcome.

Regards,
-- 
Shigeru Hanada
diff --git a/contrib/file_fdw/data/text.csv b/contrib/file_fdw/data/text.csv
index ...bd4c327 .
*** a/contrib/file_fdw/data/text.csv
--- b/contrib/file_fdw/data/text.csv
***
*** 0 
--- 1,4 
+ 123,123
+ abc,abc
+ NULL,NULL
+ ABC,ABC
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 224e74f..e846176 100644
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 23,30 
--- 23,32 
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
+ #include nodes/makefuncs.h
  #include optimizer/cost.h
  #include utils/rel.h
+ #include utils/syscache.h
  
  PG_MODULE_MAGIC;
  
*** static struct FileFdwOption valid_option
*** 57,72 
{escape, ForeignTableRelationId},
{null, ForeignTableRelationId},
{encoding, ForeignTableRelationId},
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
-   /*
-* force_not_null is not supported by file_fdw.  It would need a parser
-* for list of columns, not to mention a way to check the column list
-* against the table.
-*/
  
/* Sentinel */
{NULL, InvalidOid}
--- 59,70 
{escape, ForeignTableRelationId},
{null, ForeignTableRelationId},
{encoding, ForeignTableRelationId},
+   {force_not_null, AttributeRelationId},/* specified as boolean 
value */
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
  
/* Sentinel */
{NULL, InvalidOid}
*** static void fileGetOptions(Oid foreignta
*** 112,117 
--- 110,116 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
+ static List * get_force_not_null(Oid relid);
  
  
  /*
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 145,150 
--- 144,150 
List   *options_list = untransformRelOptions(PG_GETARG_DATUM(0));
Oid catalog = PG_GETARG_OID(1);
char   *filename = NULL;
+   char   *force_not_null = NULL;
List   *other_options = NIL;
ListCell   *cell;
  
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 198,204 
 buf.data)));
}
  
!   /* Separate out filename, since ProcessCopyOptions won't allow 
it */
if (strcmp(def-defname, filename) == 0)
{
if (filename)
--- 198,207 
 buf.data)));
}
  
!   /*
!* Separate out filename and force_not_null, since 
ProcessCopyOptions
!* won't allow them.
!*/
if (strcmp(def-defname, filename) == 0)
{
if (filename)
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 207,212 
--- 210,229 
 errmsg(conflicting or 
redundant options)));
filename = defGetString(def);
}
+   else if (strcmp(def-defname, force_not_null) == 0)
+   {
+   if (force_not_null)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg(conflicting or 
redundant options)));
+ 
+   force_not_null = defGetString(def);
+   if (strcmp(force_not_null, true) != 0 
+   strcmp(force_not_null, false) != 0)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg(force_not_null must be 
true or false)));
+   }
else
other_options = 

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Kohei KaiGai
2011/8/7 Tom Lane t...@sss.pgh.pa.us:
 Kohei KaiGai kai...@kaigai.gr.jp writes:
 I'm under implementation of this code according to the suggestion.
 However, I'm not sure whether it is really portable way (at least, GCC 
 accepts),
 and whether the interface is simpler than as Robert suggested at first.

 #define get_object_property_attnum_name(objtype)                        \
     ({  AttrNumber temp;                                            \
         get_object_property((objtype), NULL, NULL, NULL, NULL,          \
                             temp, NULL, NULL);                     \
         temp; })

 Blocks within expressions are a gcc-ism and will fail on any other
 compiler, so you can't do it that way.

Thanks for your suggestion.
So, it seems to me the interface should return a pointer to the entry
of array being specified, rather than above approach.

E.g, the above macro could be probably rewritten as follows:
  #define get_object_property_attnum_name(objtype) \
  (get_object_property(objtype)-attnum_name)

-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] per-column FDW options, v5

2011-08-08 Thread Shigeru Hanada
(2011/07/29 17:37), Shigeru Hanada wrote:
 I also attached a rebased version of force_not_null patch, which adds
 force_not_null option support to file_fdw.  This is a use case of
 per-column FDW option.

[just for redirection]

Robert has committed only per_column_option patch. So I posted
force_not_null patch in a new thread, and added it to CF 2011-09 as an
independent new item.

https://commitfest.postgresql.org/action/patch_view?id=615

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plperl crash with Debian 6 (64 bit), pl/perlu, libwww and https

2011-08-08 Thread Alex Hunsaker
On Sun, Aug 7, 2011 at 17:06, Tim Bunce tim.bu...@pobox.com wrote:

 On Sat, Aug 06, 2011 at 12:37:28PM -0600, Alex Hunsaker wrote:
 ...
 Find attached a version that does the equivalent of local %SIG for
 each pl/perl(u) call.

 +     gv = gv_fetchpv(SIG, 0, SVt_PVHV);
 +     save_hash(gv);                  /* local %SIG */

 ... [ local %SIG dosn't work ] The %SIG does become empty but the OS
 level handlers, even those installed by perl, *aren't changed*:

Looks like I trusted in $SIG{'ALRM'} being undef after it had been set
in a different scope too much :-( Thanks for pointing this out.

 That sure seems like a bug (I'll check with the perl5-porters list).

Well even if it was deemed a bug, it dont do us any good.

 Localizing an individual element of %SIG works fine.
 In C that's something like this (untested):

    hv = gv_fetchpv(SIG, 0, SVt_PVHV);
    keysv = ...SV containing ALRM...
    he = hv_fetch_ent(hv, keysv, 0, 0);
    if (he) {  /* arrange to restore existing elem */
        save_helem_flags(hv, keysv, HeVAL(he), SAVEf_SETMAGIC);
    }
    else {     /* arrange to delete a new elem */
        SAVEHDELETE(hv, keysv);
    }

I played with this a bit... and found yes, it locals them but no it
does not fix the reported problem. After playing with things a bit
more I found even local $SIG{'ALRM'} = .,..; alarm(1); still results
in postgres crashing. To wit, local does squat. AFAICT it just resets
the signal handler back to the default with SIG_DFL. (Which in
hindsight I don't know what else I expected it to-do...)

So I think for this to be robust we would have to detect what signals
they set and then reset those back to what postgres wants. Doable, but
is it worth it? Anyone else have any bright ideas?

Find below my test case and attached a patch that locals individual
%SIG elements the way mentioned above.

= set statement_timeout to '5s';
SET

= create or replace function test_alarm() returns void as $$ local
$SIG{'ALRM'} = sub { warn alarm; }; alarm(1); sleep 2; $$ language
plperlu;
CREATE FUNCTION

= select test_alarm();
WARNING:  alarm at line 1.
CONTEXT:  PL/Perl function test_alarm
 test_alarm


(1 row)

= select pg_sleep(6);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Server Log:
WARNING:  alarm at line 1.
CONTEXT:  PL/Perl function test_alarm
LOG:  server process (PID 32659) was terminated by signal 14: Alarm clock
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  the database system is in recovery mode
*** a/src/pl/plperl/expected/plperl.out
--- b/src/pl/plperl/expected/plperl.out
***
*** 639,641  CONTEXT:  PL/Perl anonymous code block
--- 639,643 
  DO $do$ use warnings FATAL = qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
  ERROR:  Useless use of sort in scalar context at line 1.
  CONTEXT:  PL/Perl anonymous code block
+ DO $do$ die SIG ALRM is set: $SIG{'ALRM'} if($SIG{'ALRM'}); $SIG{'ALRM'} = sub { print alarm!\n}; $do$ LANGUAGE plperl;
+ DO $do$ die SIG ALRM is set: $SIG{'ALRM'} if($SIG{'ALRM'}); $do$ LANGUAGE plperl;
*** a/src/pl/plperl/plperl.c
--- b/src/pl/plperl/plperl.c
***
*** 268,273  static void plperl_inline_callback(void *arg);
--- 268,275 
  static char *strip_trailing_ws(const char *msg);
  static OP  *pp_require_safe(pTHX);
  static void activate_interpreter(plperl_interp_desc *interp_desc);
+ static void local_sigs(void);
+ static void local_sig(HV *hv, SV *tmpsv, const char *signame);
  
  #ifdef WIN32
  static char *setlocale_perl(int category, char *locale);
***
*** 1901,1906  plperl_call_perl_func(plperl_proc_desc *desc, FunctionCallInfo fcinfo)
--- 1903,1910 
  	ENTER;
  	SAVETMPS;
  
+ 	local_sigs();
+ 
  	PUSHMARK(SP);
  	EXTEND(sp, desc-nargs);
  
***
*** 1968,1973  plperl_call_perl_func(plperl_proc_desc *desc, FunctionCallInfo fcinfo)
--- 1972,2028 
  	return retval;
  }
  
+ /*
+  * local all of our sig handlers some modules like LWP like to set an alarm sig
+  * handler for things like network timeouts, this can cause bad stuff to happen
+  * (not to mention what happens if someone sets USR1)
+  *
+  * for now we just local() them all so they should get reset back to what
+  * postgres expects when their pl function is done
+  */
+ static void
+ local_sigs(void)
+ {
+ 	HV	*hv;
+ 	SV	*sv = newSV(9);
+ 	int i;
+ 
+ 	hv = get_hv(SIG, 0);
+ 	if (!hv)
+ 		elog(ERROR, couldn't fetch %%SIG);
+ 
+ 	/*
+ 	 * char 

Re: [HACKERS] plperl crash with Debian 6 (64 bit), pl/perlu, libwww and https

2011-08-08 Thread Tim Bunce
On Mon, Aug 08, 2011 at 01:23:08AM -0600, Alex Hunsaker wrote:
 On Sun, Aug 7, 2011 at 17:06, Tim Bunce tim.bu...@pobox.com wrote:
 
  Localizing an individual element of %SIG works fine.
  In C that's something like this (untested):
 
     hv = gv_fetchpv(SIG, 0, SVt_PVHV);
     keysv = ...SV containing ALRM...
     he = hv_fetch_ent(hv, keysv, 0, 0);
     if (he) {  /* arrange to restore existing elem */
         save_helem_flags(hv, keysv, HeVAL(he), SAVEf_SETMAGIC);
     }
     else {     /* arrange to delete a new elem */
         SAVEHDELETE(hv, keysv);
     }
 
 I played with this a bit... and found yes, it locals them but no it
 does not fix the reported problem. After playing with things a bit
 more I found even local $SIG{'ALRM'} = .,..; alarm(1); still results
 in postgres crashing. To wit, local does squat. AFAICT it just resets
 the signal handler back to the default with SIG_DFL. (Which in
 hindsight I don't know what else I expected it to-do...)

Ah, yes. Hindsight is great. I should have spotted that. Sorry.

 So I think for this to be robust we would have to detect what signals
 they set and then reset those back to what postgres wants. Doable, but
 is it worth it? Anyone else have any bright ideas?

I'm only considering ALRM. At least that's the only one that seems worth
offering some limited support for. The others fall under don't do that.

After giving it some more thought it seems reasonable to simply force the
SIGALRM handler back to postgres when a plperlu function returns:

pqsignal(SIGALRM, handle_sig_alarm);

Tim.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Anssi Kääriäinen

On 08/07/2011 12:25 PM, Hannu Krosing wrote:

On Sun, 2011-08-07 at 11:15 +0200, Hannu Krosing wrote:

On Wed, 2011-08-03 at 15:19 -0400, Tom Lane wrote:

Hm, you mean reverse-engineering the parameterization of the query?

Yes, basically re-generate the query after (or while) parsing, replacing
constants and arguments with another set of generated arguments and
printing the list of these arguments at the end. It may be easiest to do
This in parallel with parsing.


Interesting thought, but I really don't see a way to make it practical.

Another place where this could be really useful is logging  monitoring

If there were an option to log the above queries as

SELECT a, b FROM foo WHERE c = $1, (123)
SELECT a, b FROM foo WHERE c = $1, (97)
SELECT a, b FROM foo WHERE c = $1, (236)

The main monitoring use_case would be pg_stat_statements,
http://developer.postgresql.org/pgdocs/postgres/pgstatstatements.html
which is currently pretty useless for queries without parameters


I was trying to implement something similar for pgpool-II. The user 
could configure queries for which cached plans are wanted. The 
configuration would have been a file containing lines in format SELECT 
* FROM foo WHERE id = ?. I did not get anything implemented, as there 
were some problems. The problems were mainly with DEALLOCATE ALL called 
without pgpool-II knowing it, issues with search_path and the amount of 
work needed to implement parse tree matching.


It would be interesting if pg_stat_statements would be globally 
available with queries using generic arguments. First, there would be an 
obvious heuristic for when to cache the plan: If the average runtime of 
the query is much larger than the average planning time, there is no 
point in caching the plan. This would also give one option for cache hit 
estimation. The hit_percent is directly available. On the other hand 
pg_stat_statements could easily become a choke-point.


I would love to work on this, but I lack the needed skills. Maybe I 
could take another try for writing a proof-of-concept parse tree 
transformer and matcher, but I doubt I can produce anything useful.


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Heikki Linnakangas

On 07.08.2011 22:28, Alexander Korotkov wrote:

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.


Ok.


2) Neighbor relocation was added.


Ok. I think we're going to need some sort of a heuristic on when to 
enable neighbor relocation. If I remember the performance tests 
correctly, it improves the quality of the resulting index, but incurs a 
significant CPU overhead.


Actually, looking at the performance numbers on the wiki page again 
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it 
looks like neighbor relocation doesn't help very much with the index 
quality - sometimes it even results in a slightly worse index. Based on 
those results, shouldn't we just remove it? Or is there some other data 
set where it helps significantly?



3) Subtree prefetching was added.


I'm inclined to leave out the prefetching code for now. Unless you have 
some performance numbers that show that it's critical for the overall 
performance. But I don't think that was the case, it's just an 
additional optimization for servers with big RAID arrays. So, please 
separate that into an add-on patch. It needs to be performance tests and 
reviewed separately.



4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.


I was trying to get rid of that complexity during index build. Some 
extra code in the final pass would be easier to understand than extra 
work that needs to be done through the index build. It's not a huge 
amount of code, but still.


I'm not worried about the extra CPU overhead of scanning the data 
structures at the final pass. I guess in my patch you had to do extra 
I/O as well, because the buffers were not emptied in strict top-down 
order, so let's avoid that. How about:


Track all buffers in the lists, not only those that are non-empty. Add 
the buffer to the right list at getNodeBuffer(). That way in the final 
stage, you need to scan through all buffers instead of just the 
non-empty ones. But the overhead of that to should be minimal in 
practice, scanning some in-memory data structures is pretty cheap 
compared to building an index. That way you wouldn't need to maintain 
the lists during the index build, except for adding each buffer to 
correct lists in getNodeBuffer().


BTW, please use List for the linked lists. No need to re-implement the 
wheel.



5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).


Yeah, that's a surprisingly hard problem. I don't much like the method 
used in the patch either.



2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.


I'm not very worried about that in practice. If you have a very large 
index, you presumably have a fair amount of memory too. Otherwise the 
machine is horrendously underpowered to build or do anything useful with 
the index anyway. Nevertheless it would nice to have some figures on 
that. If you have, say, an index of 1 TB in size, how much memory will 
building the index need?


Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious 
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like 
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the 
normal insertion code. The loaded part nomenclature is obsolete, as 
the patch doesn't explicitly load parts of the tree into memory anymore. 
Think about the names of other structs, variables and functions too, 
GISTLoadedPartItem just caught my eye first but there's probably others 
that could have better names.

* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler 

Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Hannu Krosing
On Mon, 2011-08-08 at 11:39 +0300, Anssi Kääriäinen wrote:
 On 08/07/2011 12:25 PM, Hannu Krosing wrote:
  On Sun, 2011-08-07 at 11:15 +0200, Hannu Krosing wrote:
  On Wed, 2011-08-03 at 15:19 -0400, Tom Lane wrote:
  Hm, you mean reverse-engineering the parameterization of the query?
  Yes, basically re-generate the query after (or while) parsing, replacing
  constants and arguments with another set of generated arguments and
  printing the list of these arguments at the end. It may be easiest to do
  This in parallel with parsing.
 
  Interesting thought, but I really don't see a way to make it practical.
  Another place where this could be really useful is logging  monitoring
 
  If there were an option to log the above queries as
 
  SELECT a, b FROM foo WHERE c = $1, (123)
  SELECT a, b FROM foo WHERE c = $1, (97)
  SELECT a, b FROM foo WHERE c = $1, (236)
  The main monitoring use_case would be pg_stat_statements,
  http://developer.postgresql.org/pgdocs/postgres/pgstatstatements.html
  which is currently pretty useless for queries without parameters
 
 I was trying to implement something similar for pgpool-II. The user 
 could configure queries for which cached plans are wanted. The 
 configuration would have been a file containing lines in format SELECT 
 * FROM foo WHERE id = ?. I did not get anything implemented, as there 
 were some problems. The problems were mainly with DEALLOCATE ALL called 
 without pgpool-II knowing it, issues with search_path and the amount of 
 work needed to implement parse tree matching.
 
 It would be interesting if pg_stat_statements would be globally 
 available with queries using generic arguments. First, there would be an 
 obvious heuristic for when to cache the plan: If the average runtime of 
 the query is much larger than the average planning time, there is no 
 point in caching the plan. This would also give one option for cache hit 
 estimation. The hit_percent is directly available. On the other hand 
 pg_stat_statements could easily become a choke-point.
 
 I would love to work on this, but I lack the needed skills. Maybe I 
 could take another try for writing a proof-of-concept parse tree 
 transformer and matcher, but I doubt I can produce anything useful.

That is why I think it is best done in the main parser - it has to parse
and analyse the query anyway and likely knows which constants are
arguments to the query.

If doing it outside the main backend parser, it would be best to still
use the postgreSQL lex/bison files as much as possible for this.


-- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Alexander Korotkov
On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 2) Neighbor relocation was added.


 Ok. I think we're going to need some sort of a heuristic on when to enable
 neighbor relocation. If I remember the performance tests correctly, it
 improves the quality of the resulting index, but incurs a significant CPU
 overhead.

 Actually, looking at the performance numbers on the wiki page again (
 http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**GSoC_2011http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011),
 it looks like neighbor relocation doesn't help very much with the index
 quality - sometimes it even results in a slightly worse index. Based on
 those results, shouldn't we just remove it? Or is there some other data set
 where it helps significantly?

Oh, actually I didn't add some results with neighborrelocation = off. I
would like to rerun some tests with current version of patch.


  3) Subtree prefetching was added.


 I'm inclined to leave out the prefetching code for now. Unless you have
 some performance numbers that show that it's critical for the overall
 performance. But I don't think that was the case, it's just an additional
 optimization for servers with big RAID arrays. So, please separate that into
 an add-on patch. It needs to be performance tests and reviewed separately.

I though that prefetch helps even on separate hard disks by ordering of IOs.


  4) Final emptying algorithm was reverted to the original one. My
 experiments shows that typical number of passes in final emptying in
 your version of patch is about 5. It may be significant itself. Also I
 haven't estimate of number of passes and haven't guarantees that it will
 not be high in some corner cases. I.e. I prefer more predictable
 single-pass algorithm in spite of it's a little more complex.


 I was trying to get rid of that complexity during index build. Some extra
 code in the final pass would be easier to understand than extra work that
 needs to be done through the index build. It's not a huge amount of code,
 but still.

 I'm not worried about the extra CPU overhead of scanning the data
 structures at the final pass. I guess in my patch you had to do extra I/O as
 well, because the buffers were not emptied in strict top-down order, so
 let's avoid that. How about:

 Track all buffers in the lists, not only those that are non-empty. Add the
 buffer to the right list at getNodeBuffer(). That way in the final stage,
 you need to scan through all buffers instead of just the non-empty ones. But
 the overhead of that to should be minimal in practice, scanning some
 in-memory data structures is pretty cheap compared to building an index.
 That way you wouldn't need to maintain the lists during the index build,
 except for adding each buffer to correct lists in getNodeBuffer().

 BTW, please use List for the linked lists. No need to re-implement the
 wheel.

Ok.


  5) Unloading even last page of node buffer from main memory to the disk.
 Imagine that that with levelstep = 1 each inner node has buffer. It
 seems to me that keeping one page of each buffer in memory may be memory
 consuming.

 Open items I see at this moment:
 1) I dislike my switching to buffering build method because it's based
 on very unproven assumptions. And I didn't more reliable assumptions in
 scientific papers while. I would like to replace it with something much
 simplier. For example, switching to buffering build when regular build
 actually starts to produce a lot of IO. For this approach implementation
 I need to somehow detect actual IO (not just buffer read but miss of OS
 cache).


 Yeah, that's a surprisingly hard problem. I don't much like the method used
 in the patch either.

Is it possible to make buffering build a user defined option until we have a
better idea?


  2) I'm worrying about possible size of nodeBuffersTab and path items. If
 we imagine extremely large tree with levelstep = 1 size of this
 datastructures will be singnificant. And it's hard to predict that size
 without knowing of tree size.


 I'm not very worried about that in practice. If you have a very large
 index, you presumably have a fair amount of memory too. Otherwise the
 machine is horrendously underpowered to build or do anything useful with the
 index anyway. Nevertheless it would nice to have some figures on that. If
 you have, say, an index of 1 TB in size, how much memory will building the
 index need?

I think with points it would be about 1 million of buffers and about 100-300
megabytes of RAM depending on space utilization. It may be ok, because 1 TB
is really huge index. But if maintenance_work_mem is low we can run out of
it. Though maintenance_work_mem is quite strange for system with 1 TB
indexes.


 Miscellaneous observations:

 * Please run pgindent over the code, there's a lot of spurious whitespace
 in the patch.
 * How about renaming GISTLoadedPartItem to something like
 

[HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov
String-formatted relopts was never used before, but I've used it in
buffering GiST index build patch and encountered with following compiler
warnings:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for
‘stringRelOpts[0].default_val’**)

It is caused by definition of default field of relopt_string structure as
1-length character array. This seems to be a design flaw in the reloptions.c
code. Any thoughts?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Anssi Kääriäinen

On 08/08/2011 01:07 PM, Hannu Krosing wrote:

That is why I think it is best done in the main parser - it has to parse
and analyse the query anyway and likely knows which constants are
arguments to the query
As far as I understand the problem, the parsing must transform table 
references to schema-qualified references. The table_foobar in SELECT * 
FROM table_foobar WHERE id = ? is not enough to identify a table. Using 
search_path, query_str as a key is one possibility, but the search_path 
is likely to be different for each user, and this could result in the 
same query being cached multiple times.


By the way, I checked current Git HEAD and pg_stat_statements seems to 
not handle search_path correctly. For pg_stat_statements this is not 
critical, but if the raw query string is used as plan cache key things 
will obviously break...


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Robert Haas
On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Any protocol of that sort has *obviously* got a race condition between
 changes of the latch state and changes of the separate flag state, if run
 in a weak-memory-ordering machine.

 At least on the hardware I'm testing, it seems that the key requirement
 for a failure to be seen is that there's a lot of contention for the cache
 line containing the flag, but not for the cache line containing the latch.
 This allows the write to the flag to take longer to propagate to main
 memory than the write to the latch.

This is interesting research, especially because it provides sobering
evidence of how easily a real problem could be overlooked in testing.
If it took several minutes for this to fail on an 8-CPU system, it's
easy to imagine a less egregious example sitting in our code-base for
years undetected.  (Indeed, I think we probably have a few of those
already...)

On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way.  Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch.  Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.

 I'm not sure whether we need to do anything about this for 9.1; it may be
 that none of the existing uses of latches are subject to the kind of
 contention that would create a risk.  But I'm entirely convinced that we
 need to insert some memory-barrier instructions into the latch code before
 we expand our uses of latches much more.  Since we were already finding
 that memory barriers will be needed for performance reasons elsewhere,
 I think it's time to bite the bullet and develop that infrastructure.

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases.  The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


barrier-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] vacuumlo patch

2011-08-08 Thread Robert Haas
On Sun, Aug 7, 2011 at 7:53 PM, Tim elatl...@gmail.com wrote:
 Thanks Josh,
 I like the ability to bail out on PQTRANS_INERROR, and I think it's a small
 enough fix to be appropriate to include in this patch.
 I did consider it before but did not implement it because I am still new to
 pgsql-hackers and did not know how off-the-cuff.
 So thanks for the big improvement.

I've committed this patch with some changes, mostly cosmetic.  One
not-quite-so-cosmetic change is that I removed the suggestion that -l
should be used with a limit one lower than whatever provoked the
previous failure.  That might be true on a completely idle system, but
is an oversimplification in real life.  Maybe some kind of hint is
appropriate here, but I think if we're going to have one it ought to
be more generically worded.  I also updated the failure error message
to avoid saying that we failed to remove NNN objects.  Instead, it
now says how many objects it wanted to remove and how far it had
gotten when it failed, which I think is more useful.

Please feel free to propose further patches if you don't like what
I've done here, or want to further build on it or fine-tune...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 The approach is to move the important things from a LWLock to a
 spinlock, and to not do any locking for increments to clock-hand
 increment and numBufferAllocs.
 That means that some buffers might occasionally get inspected twice
 and some might not get inspected at all during any given clock cycle,
 but this should not lead to any correctness problems.   (Disclosure:
 Tom didn't like this approach when it was last discussed.)

 I just offer this for whatever it is worth to you--I'm not proposing
 it as an actual patch to be applied.

Interesting approach.

 When data fits in RAM but not shared_buffers, maybe the easiest fix is
 to increase shared_buffers.  Which brings up the other question I had
 for you about your work with Nate's celebrated loaner machine.  Have
 you tried to reproduce the performance problems that have been
 reported (but without public disclosure of how to reproduce) with
 shared_buffers  8GB on machines with RAM 8GB ?

No.  That's on my list, but thus far has not made it to the top of
said list.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Robert Haas robertmh...@gmail.com writes:
 It would be nice if the Linux guys would fix this problem for us, but
 I'm not sure whether they will.  For those who may be curious, the
 problem is in generic_file_llseek() in fs/read-write.c.  On a platform
 with 8-byte atomic reads, it seems like it ought to be very possible
 to read inode-i_size without taking a spinlock.  A little Googling
 around suggests that some patches along these lines have been proposed
 and - for reasons that I don't fully understand - rejected.  That now
 seems unfortunate.  Barring a kernel-level fix, we could try to
 implement our own cache to work around this problem.  However, any
 such cache would need to be darn cheap to check and update (since we
 can't assume that relation extension is an infrequent event) and must
 somehow having the same sort of mutex contention that's killing the
 kernel in this workload.

 What about making the relation extension much less frequent?  It's been
 talked about before here, that instead of extending 8kB at a time we
 could (should) extend by much larger chunks.  I would go as far as
 preallocating the whole next segment (1GB) (in the background) as soon
 as the current is more than half full, or such a policy.

 Then you have the problem that you can't really use lseek() anymore to
 guess'timate a relation size, but Tom said in this thread that the
 planner certainly doesn't need something that accurate.  Maybe the
 reltuples would do?  If not, it could be that some adapting of its
 accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

However, it doesn't really do anything to solve this problem.  The
problem here is not that the size of the relation is changing too
frequently - indeed, it's not changing at all in this test case.  The
problem is rather that testing whether or not the size has in fact
changed is costing too much.

The reason why we are testing the size of the relation here rather
than just using reltuples is because the relation might have been
extended since it was last analyzed.  We can't count on analyze to run
often enough to avoid looking at the actual file size.  If the file's
grown, we have to scale the number of tuples up proportional to the
growth in relpages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas
On Sat, Aug 6, 2011 at 3:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 My experiments have shown that the freelist proper is not
 substantially faster than the freelist clocksweep--and that is even
 under the assumption that putting things back into the freelist is
 absolutely free.

 The freelist isn't there to make buffer allocation faster, though;
 it's there for allocation efficiency.  The point is that when some
 buffers have become completely useless (eg, because we dropped the table
 they were for), they'll be recycled in preference to reclaiming buffers
 that contain still-possibly-useful data.  It would certainly be simple
 to get rid of the freelist and only recycle dead buffers when the clock
 sweep reaches them, but I think we'd be paying for that in extra,
 unnecessary I/O.

This is an interesting point, because I think it really gets at the
heart of the trade-off we're facing here: quality of buffer allocation
vs. concurrency.

Assuming no free buffers are present, we'd ideally like to evict the
buffer that won't be used until the latest possible future time.  LRU
is an approximation of this.  Our clock-sweep algorithm is a
less-accurate approximation of this that pays for itself with less
locking.  If we really wanted true LRU, we'd have to move things
around in the list every time a buffer was pinned.  That would be
really expensive.  Our clock sweep algorithm only requires fighting
over a piece of shared state when we want to kick a buffer out, rather
than every time we want to do anything to a buffer.  And it's probably
only slightly worse in terms of buffer allocation efficiency than true
LRU, so on balance it appears to be a good trade-off.

Even so, I think it's pretty trivial to construct a workload where
performance is throttled by BufFreelistLock.  I seriously doubt we're
going to be able to solve that problem by optimizing the code that
runs while holding that lock.  We might by ourselves a few more
percentage points here or there, but if we really want to bust this
bottleneck we're going to need to move to some algorithm that is an
even-less-perfect approximation of LRU but which doesn't involve a
single cluster-wide lock that throttles all buffer allocation.  No
matter how fast you make the critical section protected by that lock,
given enough CPUs, it will eventually not be fast enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas
In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.

http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html

I tried that on a RHEL 6.1 machine with 64-cores running
2.6.32-131.6.1.el6.x86_64, and it's pretty clear that the locking
characteristics are completely different.  At 1 client, the lseek
method appears to be slightly faster, although it's not beyond belief
that the difference could be in the noise.  Above 40 cores, however,
the fstat method thumps the lseek method up one side and down the
other.

Patch and test results are attached.  Test runs are 5-minute runs with
scale factor 100 and shared_buffers=8GB.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


filesize.patch
Description: Binary data


pgbench -S at 32 cores - fstat patch comparison.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 In response to my blog post on lseek contention, someone posted a
 comment wherein they proposed using fstat() rather than lseek() to get
 file sizes.
 Patch and test results are attached.  Test runs are 5-minute runs with
 scale factor 100 and shared_buffers=8GB.

 Thoughts?

I'm a bit concerned by the fact that you've only tested this on one
operating system, and thus the performance characteristics could be
quite different elsewhere.  The comment in mdextend also points out
a way in which this might not be a win --- did you test anything besides
read-only scenarios?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund
On Monday, August 08, 2011 10:30:38 Robert Haas wrote:
 In response to my blog post on lseek contention, someone posted a
 comment wherein they proposed using fstat() rather than lseek() to get
 file sizes.
 
 Thoughts?
I don't think its a good idea to replace lseek with fstat in the long run. The 
likelihood that the lockless generic_file_llseek will get included seems rather 
high to me. In contrast to that fstat will always be more expensive than that 
as its going through a security check and then the fs' getattr implementation 
(which actually takes a lock on some fs).
On the other hand its currently lockless if the security subsystem is compiled 
out (i.e. no selinux et al) for some common fs (ext3/4, xfs).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Florian Pflug f...@phlo.org wrote:
 Going down that road opens the door to a *lot* of subtle semantic
 differences between currently equivalent queries. For example,

  UPDATE T SET f=f, a=1

 would behave differently then

  UPDATE T SET a=1

 because in the first case the new row would depend on the old
 row's value of f, while in the second case it doesn't.
 
 True.  But I'm not really bothered by that.  If the user gets an
 error message that says:
 
 ERROR: updated column f has already been modified by a BEFORE
 trigger
 
 ...the user will realize the error of their ways.
 
 There's no way to get the same result as if you'd done either
 one of them first, because  they are inextricably
 intertwined.

 In practice, my hand-wavy reference to reconciling the updates
 is a problem because of the way the trigger interface works.  It
 feels pretty impossible to decide that we're going to do the
 update, but with some other random values we dredged up from
 some other update replacing some of the ones the user explicitly
 handed to us. But if the trigger could return an indication of
 which column values it wished to override, then it seems to me
 that we could piece together a reasonable set of semantics. 
 It's not exactly clear how to make that work, though.

 I dunno, that all still feels awfully complex. As you said
 yourself, this case is quite similar to a serialization anomaly.
 Taking that correspondence further, that reconciliation of
 updates is pretty much what the EPQ machinery does in READ
 COMMITTED mode. Now, we ourselves have warned users in the past
 to *not* use READ COMMITTED mode if they do complex updates
 (e.g., use UPDATE ... FROM ...), because the behaviour of that
 reconciliation machinery in the present of concurrent updates is
 extremely hard to predict. I thus don't believe that it's a good
 idea to introduce similarly complex behaviour in other parts of
 the system - and particularly not if you cannot disable it by
 switching to another isolation level.

 Simply throwing an error, on the other hand, makes the behaviour
 simple to understand and explain.
 
 True.  I'm coming around to liking that behavior better than I did
 on first hearing, but I'm still not crazy about it, because as an
 app developer I would really like to have at least the
 unproblematic cases actually work.  Throwing an error at least
 makes it clear that you've done something which isn't supported,
 and that's probably an improvement over the current, somewhat-
 baffling behavior. However, it's not even 25% as nice as having it
 actually work as intended.  That's why, even if we can't make all
 the cases work sanely, I'd be a lot more enthusiastic about it if
 we could find a way to make at least some of them work sanely. 
 The mind-bending cases are unlikely to be the ones people do on
 purpose.
 
So, we have three options on the table here:
 
(1) We (the Wisconsin Courts) are using a very simple fix to work
around the issue so we can move forward with conversion to
PostgreSQL triggers.  A DELETE is allowed to complete if the BEFORE
trigger doesn't return NULL, even if the row was updated while the
trigger was executing.  An UPDATE fails if the BEFORE trigger
doesn't return NULL and any other update is made to the row while
the trigger was executing.
 
(2) You (Robert) have proposed (as I understand it) modifying that
approach to allow some UPDATE cases to work, where there are not
conflicting updates to any one column within the row.
 
(3) Florian has proposed banning all updates to a row which is being
processed by a BEFORE UPDATE or BEFORE DELETE trigger.  As I
understand it, this would be similar to the approach taken by
Oracle, although less strict.
 
I can live with any of these solutions.  How do we move forward to
consensus so that a patch can be written?  Does anyone else have a
preference for one of these approaches versus another?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] mosbench revisited

2011-08-08 Thread Jesper Krogh

On 2011-08-08 15:29, Robert Haas wrote:

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontainedimi...@2ndquadrant.fr  wrote:

Robert Haasrobertmh...@gmail.com  writes:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode-i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Peter Geoghegan
On 8 August 2011 13:47, Robert Haas robertmh...@gmail.com wrote:
 On the flip side, I'm not sure that anyone ever really expected that a
 latch could be safely used this way.  Normally, I'd expect the flag to
 be some piece of state protected by an LWLock, and I think that ought
 to be OK provided that the lwlock is released before setting or
 checking the latch.

I'm inclined to agree. FWIW I'll point out that in addition to your
point, it's also worth remembering that a shared latch isn't always
necessary, as in the case of the archiver, so this shouldn't
fundamentally shake our confidence in the latch implementation.

 Maybe we should try to document the contract here
 a little better; I think it's just that there must be a full memory
 barrier (such as LWLockRelease) in both processes involved in the
 interaction.

Or, maybe we should think about pushing that into the latch
implementation, while providing an interface that is easy to use
correctly and hard to use incorrectly. The latch code already has a
pretty complicated contract, and I had to bend over backwards using it
in the AVLauncher patch that I submitted to the upcoming commitfest.
Weakening or equivocating the contract any further should be
considered a last resort.

 I've been thinking about this too and actually went so far as to do
 some research and put together something that I hope covers most of
 the interesting cases.  The attached patch is pretty much entirely
 untested, but reflects my present belief about how things ought to
 work.

That's interesting. Nice work.

It seems likely that Windows will support ARM in some way in the next
couple of years, so it's good that you're handling this using a win32
abstraction where available. Of course, Itanic is currently supported
on Windows, though not by us, and it is obviously never going to be
worth pursuing such support. The point is that it generally isn't safe
to assume that we'll only ever support Windows on x86 and x86_64.

I'd like to see a #warning where you fall back on acquiring and
releasing a spinlock, or at least a #warning if that in turn falls
back on the semaphore-based spinlock implementation. Granted, that
directive isn't in the C standard, but it's pretty portable in
practice. If it breaks the build on some very esoteric platform, that
may not be a bad thing - in fact, maybe you should use the portable
#error directive to make sure it breaks. I'd rather have someone
complain about that than lots more people complain about Postgres
being inexplicably slow, or worse not complain and dismiss Postgres
out of hand for their use-case.

By the way, I don't think that the comment Won't work on Visual
Studio 2003 is accurate. Everything you do for the
defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 10:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm a bit concerned by the fact that you've only tested this on one
 operating system, and thus the performance characteristics could be
 quite different elsewhere.  The comment in mdextend also points out
 a way in which this might not be a win --- did you test anything besides
 read-only scenarios?

Nope.

On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund and...@anarazel.de wrote:
 I don't think its a good idea to replace lseek with fstat in the long run. The
 likelihood that the lockless generic_file_llseek will get included seems 
 rather
 high to me. In contrast to that fstat will always be more expensive than that
 as its going through a security check and then the fs' getattr implementation
 (which actually takes a lock on some fs).

*scratches head*  I understand that stat() would need a security
check, but why would fstat()?

I think both of you raise good points.  I wasn't too enthusiastic
about this approach either.  It's not very appealing to adopt an
approach where the right performance decision is going to depend on
operating system, file system, kernel version, core count, and
workload.  We could add a GUC, but it would be pretty annoying to have
a setting that won't matter for most people at all, except
occasionally when it makes a huge difference.

I wasn't aware that was any current activity around this on the Linux
side.  But Andres' comments made me Google it again, and now I see
this:

https://lkml.org/lkml/2011/6/16/800

Andes, any idea what the status of that patch is?  I'm not clear on
how Linux works in terms of things getting upstreamed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Florian Pflug
On Aug8, 2011, at 17:02 , Kevin Grittner wrote:
 So, we have three options on the table here:
 
 (1) We (the Wisconsin Courts) are using a very simple fix to work
 around the issue so we can move forward with conversion to
 PostgreSQL triggers.  A DELETE is allowed to complete if the BEFORE
 trigger doesn't return NULL, even if the row was updated while the
 trigger was executing.  An UPDATE fails if the BEFORE trigger
 doesn't return NULL and any other update is made to the row while
 the trigger was executing.
 
 (2) You (Robert) have proposed (as I understand it) modifying that
 approach to allow some UPDATE cases to work, where there are not
 conflicting updates to any one column within the row.
 
 (3) Florian has proposed banning all updates to a row which is being
 processed by a BEFORE UPDATE or BEFORE DELETE trigger.  As I
 understand it, this would be similar to the approach taken by
 Oracle, although less strict.
Yeah, though much, much less strict.

I think it would be helpful if we had a more precise idea about the
intended use-cases. So far, the only use-case that has been described in
detail is the one which made Kevin aware of the problem. But if I
understood Kevin correctly, that fact that they use BEFORE and not AFTER
triggers it more of an accident than the result of a conscious design
decision. Though he did also mention that there might actually be advantages
to using BEFORE instead of AFTER triggers, because that prevents other
triggers from seeing a non-consistent state.

What I can add is that AFAIR all instances of same-row UPDATES from BEFORE
triggers I ever encountered where replaceable by a simple assignment to NEW.
(That, however, is more of an anti-usecase)

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of lun ago 08 06:27:33 -0400 2011:
 String-formatted relopts was never used before, but I've used it in
 buffering GiST index build patch and encountered with following compiler
 warnings:
 
 reloptions.c:259: warning: initializer-string for array of chars is too long
 reloptions.c:259: warning: (near initialization for
 ‘stringRelOpts[0].default_val’**)
 
 It is caused by definition of default field of relopt_string structure as
 1-length character array. This seems to be a design flaw in the reloptions.c
 code. Any thoughts?

Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try that 
please?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov
On Mon, Aug 8, 2011 at 7:43 PM, Alvaro Herrera
alvhe...@commandprompt.comwrote:

 Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try
 that please?


typedef struct relopt_string
{
relopt_gen gen;
int default_len;
 bool default_isnull;
validate_string_relopt validate_cb;
 char default_val[1]; /* variable length, zero-terminated */
} relopt_string;

static relopt_string stringRelOpts[] =
...

I doubt variable-length data structure is possible in this case, because we
don't have array of pointers to relopt_string, but just array
of relopt_string. May be just
char *default_val;
is possible?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 11:28 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Maybe we should try to document the contract here
 a little better; I think it's just that there must be a full memory
 barrier (such as LWLockRelease) in both processes involved in the
 interaction.

 Or, maybe we should think about pushing that into the latch
 implementation, while providing an interface that is easy to use
 correctly and hard to use incorrectly.

That would probably result in unnecessary double-synchronization in
many cases, though.  Memory barriers are expensive, and I think we'd
better adopt a policy of trying to figure out where they are truly
needed and using them only in those places.  Or to put that another
way, having spent the last couple of months trying to reduce our
synchronization overhead, I'm not real keen on adding more unless we
really need it.

 I've been thinking about this too and actually went so far as to do
 some research and put together something that I hope covers most of
 the interesting cases.  The attached patch is pretty much entirely
 untested, but reflects my present belief about how things ought to
 work.

 That's interesting. Nice work.

Thanks!

 It seems likely that Windows will support ARM in some way in the next
 couple of years, so it's good that you're handling this using a win32
 abstraction where available. Of course, Itanic is currently supported
 on Windows, though not by us, and it is obviously never going to be
 worth pursuing such support. The point is that it generally isn't safe
 to assume that we'll only ever support Windows on x86 and x86_64.

Agreed.  I saw an MSDN article that referred to architectures on
which Windows is supported.  It didn't say which those were, but I
figured I'd better not assume they were all platforms with strong
memory ordering.

 I'd like to see a #warning where you fall back on acquiring and
 releasing a spinlock, or at least a #warning if that in turn falls
 back on the semaphore-based spinlock implementation. Granted, that
 directive isn't in the C standard, but it's pretty portable in
 practice. If it breaks the build on some very esoteric platform, that
 may not be a bad thing - in fact, maybe you should use the portable
 #error directive to make sure it breaks. I'd rather have someone
 complain about that than lots more people complain about Postgres
 being inexplicably slow, or worse not complain and dismiss Postgres
 out of hand for their use-case.

I was thinking about whether and how we could expose that information.
 I was almost tempted to add a read-only GUC that would output the
details of what kind of synchronization we're using.  So the bad case
would look something like this:

spinlock=semaphore,memory_barrier=spinlock,read_barrier=spinlock,write_barrier=spinlock,compiler_barrier=spinlock

And the good case would look something like this:

spinlock=gcc-inline-asm,memory_barrier=gcc-inline-asm,read_barrier=gcc-inline-asm,write_barrier=gcc-inline-asm,compiler_barrier=gcc-inline-asm

And you can imagine other values, like compiler-intrinsic.  Of course,
jamming a whole CSV file into a GUC value is a bit unappealing.  Maybe
it would be better to have a system view with two columns,
synchronization_primitive and implementation.

Either way, I'd prefer this to a #warning, because that's only going
to be visible at compile-time, which means that when $PGCOMPANY gets
called in to figure out why the system is dog slow, it's going to take
a bit of forensics to figure out where the build came from and what
options are in use.  Is that a huge problem?  Maybe not, but it seems
like it would be a nice thing to make it easy for people to log into a
running system and immediately be able to determine which forms of
wizardry we're using there.

 By the way, I don't think that the comment Won't work on Visual
 Studio 2003 is accurate. Everything you do for the
 defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.

Oh, really?  I thought I ran across something that said otherwise, but
it might have been wrong, or maybe I got confused somewhere along the
line.  I haven't tested anything more than that this compiles without
erroring out on MacOS X, so the probability of errors and omissions is
not small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera
Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:

 Thanks for your suggestion.
 So, it seems to me the interface should return a pointer to the entry
 of array being specified, rather than above approach.
 
 E.g, the above macro could be probably rewritten as follows:
   #define get_object_property_attnum_name(objtype) \
   (get_object_property(objtype)-attnum_name)

I don't understand why don't you just do something like

   #define get_object_property_attnum_name(objtype, attnum_name_value) \
   (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
NULL)))

and the caller does

AttrNumber  attnum_name;
get_object_property_attnum_name(OBJTYPE_TABLE, attnum_name);

i.e. the caller must still pass pointers, instead of expecting the
values to be returned.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 11:57 AM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:

 Thanks for your suggestion.
 So, it seems to me the interface should return a pointer to the entry
 of array being specified, rather than above approach.

 E.g, the above macro could be probably rewritten as follows:
   #define get_object_property_attnum_name(objtype) \
       (get_object_property(objtype)-attnum_name)

 I don't understand why don't you just do something like

   #define get_object_property_attnum_name(objtype, attnum_name_value) \
       (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
 NULL)))

 and the caller does

 AttrNumber      attnum_name;
 get_object_property_attnum_name(OBJTYPE_TABLE, attnum_name);

 i.e. the caller must still pass pointers, instead of expecting the
 values to be returned.

We could do that, but what the heck is the point?   What benefit are
we trying to get by not returning a pointer to the structure?  I feel
like we're making this ludicrously complicated with no real
justification of why all of this complexity is adding any value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] per-column FDW options, v5

2011-08-08 Thread Robert Haas
2011/8/8 Shigeru Hanada shigeru.han...@gmail.com:
 Currently table-level options are showin in result of \det+ command
 (only verbose mode), in same style as fdw and foreign servers.

 But \d is more popular for table describing, so moving table-level
 options from \det+ to \d might be better.  Thoughts?

 (2011/08/06 9:26), Robert Haas wrote:
 I'd show it both places.

 After taking a look at describe.c, I think some styles are applicable to
 FDW options in result of \d command.

 (1) simple array literal style
 Show table-level FDW options like other FDW options.  It is  simply a
 result of array_out(); each options is shown as key=value with quoting
 if necessary and delimited by ','.  Whole line is surrounded by { and }.
  If an element includes any character which need to be escaped, such
 element is quoted with double-quotation.

    Ex)
    FDW Options: {delimiter=,,quote=\}
    #delimiter is a comma, and qutoe is a double-quote

 (2) reloptions style
 Show FDW options like reloptions of \d+ result.  Each options is shown
 as key=value without quoting.  Some special characters might make it
 little illegible.

    Ex)
    FDW Options: delimiter=,, quote=
    #delimiter is a comma, and qutoe is a double-quote

 (3) OPTIONS clause style
 Show FDW options as they were in OPTIONS clause.  Each option is shown
 as key 'value', and delimited with ','.

    Ex)
    FDW Options: delimiter ',', quote 
    #delimiter is a comma, and qutoe is a single-quote

 (1) can be implemented with minimum change, and it also keeps the
 behavior consistent with other existing FDW objects.  But IMHO (3) is
 most readable, though it breaks backward compatibility about the format
 of FDW options used in the result of \d* command.  Thoughts?

I'm against #2, but I could go either way on #1 vs. #3.  If you pick
#3, would you also change the column options to be displayed that way,
or would we end up with table and column options displayed
differently?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql document fix about showing FDW options

2011-08-08 Thread Robert Haas
2011/8/8 Shigeru Hanada shigeru.han...@gmail.com:
 I noticed that psql document wrongly says that \d+ command shows
 per-table FDW options of a foreign table, but in fact, per-table FDW
 options are shown only in the result of \det+ command.  Attached patch
 removes this wrong description.

 This fix should be applied to 9.1 too.

I think we ought to fix the behavior, rather than the documentation.
It's just weird to have something show up in the list view that
doesn't appear in the detail view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera
Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:

 We could do that, but what the heck is the point?   What benefit are
 we trying to get by not returning a pointer to the structure?  I feel
 like we're making this ludicrously complicated with no real
 justification of why all of this complexity is adding any value.

I don't see that it's complicated in any way.  Seems pretty simple to me.
The justification is simple too: don't export struct definitions without
need.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of lun ago 08 11:50:53 -0400 2011:
 On Mon, Aug 8, 2011 at 7:43 PM, Alvaro Herrera
 alvhe...@commandprompt.comwrote:
 
  Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try
  that please?
 
 
 typedef struct relopt_string
 {
 relopt_gen gen;
 int default_len;
  bool default_isnull;
 validate_string_relopt validate_cb;
  char default_val[1]; /* variable length, zero-terminated */
 } relopt_string;
 
 static relopt_string stringRelOpts[] =
 ...
 
 I doubt variable-length data structure is possible in this case, because we
 don't have array of pointers to relopt_string, but just array
 of relopt_string. May be just
 char *default_val;
 is possible?

An array of relopt_string?  Isn't that a bit strange?  If I recall
correctly, the point of this was to be able to allocate the
relopt_string struct and the char array itself as a single palloc unit,
in a single call somewhere in the reloptions API (which was convoluted
in some points precisely to let the string case work).  I don't have the
details of this fresh in my mind though.  It certainly worked with more
than one string option when I committed it, IIRC.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
 We could do that, but what the heck is the point?   What benefit are
 we trying to get by not returning a pointer to the structure?  I feel
 like we're making this ludicrously complicated with no real
 justification of why all of this complexity is adding any value.

 I don't see that it's complicated in any way.  Seems pretty simple to me.

It's quite complicated.  Every time someone adds a new member to the
struct, they need to adjust the code all over the place.  The call
sequence of the accessor function will constantly be changing.  We
generally try to structure our code base so that patches don't need to
touch more call sites than necessary, and we regularly get very
nervous about patches that touch too many call sites and ask them to
be changed so that they don't.  Here, you're proposing a design that
is practically guaranteed to have that problem every time someone
makes a change in that area.

 The justification is simple too: don't export struct definitions without
 need.

That's a reasonable guiding principle, but I believe that the
underlying reason to do that is information hiding.  In other words,
if the details of what's in the struct don't matter to other modules,
then they shouldn't peek into it.  That way, when you want to change
something, you don't have to worry about whether anyone else cares
about the encapsulated data, because you already know they don't.  In
this case, however, the struct - by design - isn't going to contain
any private data.  If callers are going to potentially need access to
every member of the struct, then you may as well export it and be done
with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera
Excerpts from Kohei KaiGai's message of lun ago 08 12:18:47 -0400 2011:
 2011/8/8 Robert Haas robertmh...@gmail.com:

  We could do that, but what the heck is the point?   What benefit are
  we trying to get by not returning a pointer to the structure?  I feel
  like we're making this ludicrously complicated with no real
  justification of why all of this complexity is adding any value.
 
 I agree with Robert's opinion. It seems to me we have little benefit to
 keep the structure condidential to other components.

So you coded my suggestion in an extremely awkward way just to be able
to dismiss it?

We use that pattern in a lot of places.  See get_op_opfamily_properties
for the first example I found (took my 15 secs to find it).  I don't
understand why you think it's so complicated or horrible.

Please tell me, why don't we just return Form_pg_amop in that function?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Any protocol of that sort has *obviously* got a race condition between
 changes of the latch state and changes of the separate flag state, if run
 in a weak-memory-ordering machine.

 On the flip side, I'm not sure that anyone ever really expected that a
 latch could be safely used this way.  Normally, I'd expect the flag to
 be some piece of state protected by an LWLock, and I think that ought
 to be OK provided that the lwlock is released before setting or
 checking the latch.  Maybe we should try to document the contract here
 a little better; I think it's just that there must be a full memory
 barrier (such as LWLockRelease) in both processes involved in the
 interaction.

Heikki argued the same thing in
http://archives.postgresql.org/message-id/4cea5a0f.1030...@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe.  So [ ... warms up grep ... ]

Currrent callers of WaitLatch(OrSocket):

XLogPageRead waits on XLogCtl-recoveryWakeupLatch

latch is set by WakeupRecovery, which is called by process'
own signal handlers (OK) and XLogWalRcvFlush.  The latter is OK
because there's a spinlock protecting the transmitted data.

pgarch_MainLoop waits on mainloop_latch

non-issue because latch is process-local

WalSndLoop waits on MyWalSnd-latch

latch is set by signal handlers and WalSndWakeup.  The latter is
OK because it's called right after XLogFlush (which certainly
locks whatever it touches) or a spinlock release.

SyncRepWaitForLSN waits on MyProc-waitLatch

latch is set by signal handlers and SyncRepWakeQueue.  The
latter appears unsafe at first glance, because it sets the
shared variable (thisproc-syncRepState) and immediately
sets the latch.  However, the receiver does this curious dance:

syncRepState = MyProc-syncRepState;
if (syncRepState == SYNC_REP_WAITING)
{
LWLockAcquire(SyncRepLock, LW_SHARED);
syncRepState = MyProc-syncRepState;
LWLockRelease(SyncRepLock);
}

which I believe makes it safe, though rather underdocumented;
if a race does occur, the receiver will acquire the lock and
recheck, and the lock acquisition will block until the caller of
SyncRepWakeQueue has released SyncRepLock, and that release
will flush any pending writes to shared memory.

Conclusions:

(1) 9.1 does not have a problem of this sort.

(2) Breathing hard on any of this code could break it.

(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code.  Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.

(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking.  I think SyncRepWaitForLSN could use a warning
comment, too.  I will go do that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Heikki Linnakangas

On 08.08.2011 19:39, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way.  Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch.  Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.


Heikki argued the same thing in
http://archives.postgresql.org/message-id/4cea5a0f.1030...@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe.  So [ ... warms up grep ... ]


Thanks for double-checking.


(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code.  Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.


I agree we definitely should add the memory barrier instruction there 
once we have them.



(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking.  I think SyncRepWaitForLSN could use a warning
comment, too.  I will go do that.


Ok, thanks.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Kohei KaiGai
2011/8/8 Robert Haas robertmh...@gmail.com:
 On Mon, Aug 8, 2011 at 11:57 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:

 Thanks for your suggestion.
 So, it seems to me the interface should return a pointer to the entry
 of array being specified, rather than above approach.

 E.g, the above macro could be probably rewritten as follows:
   #define get_object_property_attnum_name(objtype) \
       (get_object_property(objtype)-attnum_name)

 I don't understand why don't you just do something like

   #define get_object_property_attnum_name(objtype, attnum_name_value) \
       (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
 NULL)))

 and the caller does

 AttrNumber      attnum_name;
 get_object_property_attnum_name(OBJTYPE_TABLE, attnum_name);

 i.e. the caller must still pass pointers, instead of expecting the
 values to be returned.

 We could do that, but what the heck is the point?   What benefit are
 we trying to get by not returning a pointer to the structure?  I feel
 like we're making this ludicrously complicated with no real
 justification of why all of this complexity is adding any value.

I agree with Robert's opinion. It seems to me we have little benefit to
keep the structure condidential to other components.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera
Excerpts from Robert Haas's message of lun ago 08 12:33:45 -0400 2011:
 On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
  We could do that, but what the heck is the point?   What benefit are
  we trying to get by not returning a pointer to the structure?  I feel
  like we're making this ludicrously complicated with no real
  justification of why all of this complexity is adding any value.
 
  I don't see that it's complicated in any way.  Seems pretty simple to me.
 
 It's quite complicated.  Every time someone adds a new member to the
 struct, they need to adjust the code all over the place.  The call
 sequence of the accessor function will constantly be changing.

You are rehashing an argument that I already rebutted: the way to solve
this problem is with the macros, the format of which we were discussing
in this sub-thread.  I am not sure why you bring it up again.

You say the reason for not exporting the struct definition is that you
can later change it without having to touch the callers.  That seems a
good argument, and it seems to me to work equally for pg_amop as for the
new header KaiGai is creating here.

In any case, at the start of the thread KaiGai was asking for opinions,
and I gave mine.  I will now shut up.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund
On Monday, August 08, 2011 11:33:29 Robert Haas wrote:

 On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund and...@anarazel.de wrote:
  I don't think its a good idea to replace lseek with fstat in the long
  run. The likelihood that the lockless generic_file_llseek will get
  included seems rather high to me. In contrast to that fstat will always
  be more expensive than that as its going through a security check and
  then the fs' getattr implementation (which actually takes a lock on
  some fs).
 *scratches head*  I understand that stat() would need a security
 check, but why would fstat()?
That I am not totally sure of either. I guess Kaigai might know more about 
that.
I guess it might be that a forked process possibly is not allowed anymore to 
access the information from an inherited file handle? Also I think a process 
can change its permissions during runtime.

 I think both of you raise good points.  I wasn't too enthusiastic
 about this approach either.  It's not very appealing to adopt an
 approach where the right performance decision is going to depend on
 operating system, file system, kernel version, core count, and
 workload.  We could add a GUC, but it would be pretty annoying to have
 a setting that won't matter for most people at all, except
 occasionally when it makes a huge difference.
 
 I wasn't aware that was any current activity around this on the Linux
 side.  But Andres' comments made me Google it again, and now I see
 this:
 
 https://lkml.org/lkml/2011/6/16/800
 
 Andes, any idea what the status of that patch is?  I'm not clear on
 how Linux works in terms of things getting upstreamed.
There doesn't seem to have been any activity to inlude it in 3.1. The merge 
window for 3.1 just ended. The next one will open for about a week after the 
release.
Its also not yet included in linux-next which is a preview for the currently 
worked on release + 1. A release takes roughly 3 months.

For upstreaming somebody needs to be persistent enough to convince one of the 
maintainers of the particular area to include the code so that linus then can 
pull that.
I guess citing your numbers would go a long way in that direction. Naturally 
it would be even better to inlcude results with the patch applied.
My largest machine I can reboot often enough to test such a thing has only two 
sockets (4cores E5520). I guess you cannot reboot your loaned machine with a 
new kernel easily?

Greetings, 
Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 We could do that, but what the heck is the point?   What benefit are
 we trying to get by not returning a pointer to the structure?

Not having an ABI break if we find it necessary to add members to the
struct ... which I grant is unlikely to happen in a minor version
update, but it could happen.

Having said that, the proposed coding with a single function returning N
pointer parameters seems like it's been written in the ugliest, least
efficient possible way, and it fails to resolve the ABI-break objection
anyway.  (If you did need another struct member, you'd also need another
return parameter, thus forcing recompiles.)  The form I had in mind was
N actual functions (not macros) that internally look like

sometype
get_object_property_foo(...)
{
structptr *p = get_object_property_struct(...);

return p-foo;
}

The only objection I can see to this is that somebody who needs several
different attributes of the same object would have to pay the lookup
cost several times.  That may or may not be an adequate reason to allow
people to fetch the struct directly; without seeing a list of places
where this would happen, it's impossible to evaluate the performance
costs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund and...@anarazel.de wrote:
 There doesn't seem to have been any activity to inlude it in 3.1. The merge
 window for 3.1 just ended. The next one will open for about a week after the
 release.
 Its also not yet included in linux-next which is a preview for the currently
 worked on release + 1. A release takes roughly 3 months.

OK.  If it doesn't get into Linux 3.2 we had better start thinking
hard about a workaround on our side.  I am not too concerned about
people hitting this with PostgreSQL 9.1 or prior, because you'd
basically need a workload targeted to exercise the problem, which
workload is not that similar to the way people actually do things in
real life.  However, in PostgreSQL 9.2devel, it's going to be much
more of a real-world problem, so I'd hate to wait until after our
feature freeze and then decide we've got a problem we have to fix.

 For upstreaming somebody needs to be persistent enough to convince one of the
 maintainers of the particular area to include the code so that linus then can
 pull that.
 I guess citing your numbers would go a long way in that direction. Naturally
 it would be even better to inlcude results with the patch applied.
 My largest machine I can reboot often enough to test such a thing has only two
 sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
 new kernel easily?

Not really.  I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov
On Mon, Aug 8, 2011 at 8:27 PM, Alvaro Herrera
alvhe...@commandprompt.comwrote:

 An array of relopt_string?  Isn't that a bit strange?  If I recall
 correctly, the point of this was to be able to allocate the
 relopt_string struct and the char array itself as a single palloc unit,
 in a single call somewhere in the reloptions API (which was convoluted
 in some points precisely to let the string case work).  I don't have the
 details of this fresh in my mind though.  It certainly worked with more
 than one string option when I committed it, IIRC.

Yes, it seems strange. But it also seems that both were added by your commit
of table-based parser:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ba748f7a11ef884277b61d1708a17a44acfd1736

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 1:16 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 We could do that, but what the heck is the point?   What benefit are
 we trying to get by not returning a pointer to the structure?

 Not having an ABI break if we find it necessary to add members to the
 struct ... which I grant is unlikely to happen in a minor version
 update, but it could happen.

I think that would only be a problem if we added members to the middle
of the struct, rather than at the end.

However...

 Having said that, the proposed coding with a single function returning N
 pointer parameters seems like it's been written in the ugliest, least
 efficient possible way, and it fails to resolve the ABI-break objection
 anyway.  (If you did need another struct member, you'd also need another
 return parameter, thus forcing recompiles.)  The form I had in mind was
 N actual functions (not macros) that internally look like

        sometype
        get_object_property_foo(...)
        {
                structptr *p = get_object_property_struct(...);

                return p-foo;
        }

 The only objection I can see to this is that somebody who needs several
 different attributes of the same object would have to pay the lookup
 cost several times.  That may or may not be an adequate reason to allow
 people to fetch the struct directly; without seeing a list of places
 where this would happen, it's impossible to evaluate the performance
 costs.

...doing it this way seems fine, because IIRC this is all only being
used to support DDL operations, and the overhead of a few extra
function calls isn't going to matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Not really.  I do have root access to a 64-core box at the moment, and
 I could probably get permission to reboot it, but if it didn't come
 back on-line that would be awkward.

Red Hat has some test hardware that I can use (... pokes around ...)
Hmm, this one looks promising:

Memory  NUMA Nodes
64348 MB 4

  Cpu
Vendor  Model Name  Family  Model   
SteppingSpeed   Processors  Cores   Sockets Hyper
GenuineIntel Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz 6   47 
 2   1064.0  80  40  4   True

If you can wrap something up to the point where someone else can
run it, I'll give it a shot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund
On Monday, August 08, 2011 13:19:13 Robert Haas wrote:
 On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund and...@anarazel.de wrote:
  There doesn't seem to have been any activity to inlude it in 3.1. The
  merge window for 3.1 just ended. The next one will open for about a
  week after the release.
  Its also not yet included in linux-next which is a preview for the
  currently worked on release + 1. A release takes roughly 3 months.
 
 OK.  If it doesn't get into Linux 3.2 we had better start thinking
 hard about a workaround on our side. 
If its ok I will write a mail to lkml referencing this thread and your numbers 
inline (with attribution obviously).
I don't think it will be that hard to convince them. But I constantly surprise 
myself with naivity so I may be wrong.


  My largest machine I can reboot often enough to test such a thing has only
  two sockets (4cores E5520). I guess you cannot reboot your loaned machine
  with a new kernel easily?
Not really.  I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.
As I feared. Any chance that the person lending you the machine can give you a 
hand?
Although I don't know how that could be after reading the code it would be 
disappointing to wait for 3.2 with the llseek fixes appearing in $distribution 
just to notice fstat is still faster for $unobvious_reason...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 12:49 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of lun ago 08 12:33:45 -0400 2011:
 On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
  We could do that, but what the heck is the point?   What benefit are
  we trying to get by not returning a pointer to the structure?  I feel
  like we're making this ludicrously complicated with no real
  justification of why all of this complexity is adding any value.
 
  I don't see that it's complicated in any way.  Seems pretty simple to me.

 It's quite complicated.  Every time someone adds a new member to the
 struct, they need to adjust the code all over the place.  The call
 sequence of the accessor function will constantly be changing.

 You are rehashing an argument that I already rebutted: the way to solve
 this problem is with the macros, the format of which we were discussing
 in this sub-thread.  I am not sure why you bring it up again.

Well, the proposed macro design seems like it doesn't go as far in
that direction as I would have hoped.  I thought you were proposing to
have one function that returns a pointer to the struct, and then a
bunch of macros that return particular elements out of the returned
struct.  If we did that, the macro definitions wouldn't need to change
when you added new members to the struct.  On the other hand, if you
have one function with N out parameters, and the macros are just
calling that function with mostly NULL arguments, then the calls have
to be updated every time you add a new struct member.  IMHO, that's a
pain.  YMMV.

 You say the reason for not exporting the struct definition is that you
 can later change it without having to touch the callers.  That seems a
 good argument, and it seems to me to work equally for pg_amop as for the
 new header KaiGai is creating here.

In the case of pg_amop, there's an additional thing to worry about,
which is that the result is allocated in some memory context, and you
now have to worry about how long it sticks around in that context vs.
how long the caller needs it for.  Copying the data into out
parameters (or a pg_amop struct in the caller's memory context, if we
had preferred to go that route) addresses that problem.  However, here
we're just dealing with static data, so that's not a problem.

Also, if get_op_opfamily_properties() had as many out parameters as
this function would likely need to, I'd favor changing that, too.
Three is reasonable.  Ten is pushing it, and any more than that is
kind of nuts.  I think we're going to end up with quite a few in this
case, because we're rapidly accreting a set of operations that work on
any old SQL object (COMMENT, SECURITY LABEL, ALTER EXTENSION ..
ADD/DROP, ALTER FOO .. SET SCHEMA, etc.).  I think that's a good
thing, but I don't want to assume that the number of struct members is
going to remain small.

 In any case, at the start of the thread KaiGai was asking for opinions,
 and I gave mine.  I will now shut up.

Sorry to have given offense.  My goal was to understand why we were
having this argument, not to piss you off.  However, I seem to have
failed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] index sizes: single table vs partitioned

2011-08-08 Thread Andrew Hammond
For a large table, should there be a difference in index sizes between a
single table representation and representation based on multiple partitions
with identical indexes?

A


Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas
On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund and...@anarazel.de wrote:
 If its ok I will write a mail to lkml referencing this thread and your numbers
 inline (with attribution obviously).

That would be great.  Please go ahead.

 I don't think it will be that hard to convince them. But I constantly surprise
 myself with naivity so I may be wrong.

Heh, heh, open source is fun.

  My largest machine I can reboot often enough to test such a thing has only
  two sockets (4cores E5520). I guess you cannot reboot your loaned machine
  with a new kernel easily?
Not really.  I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.
 As I feared. Any chance that the person lending you the machine can give you a
 hand?

Uh, maybe, but considering my relative inexperience in compiling the
Linux kernel, I'd be a little worried about having to iterate too many
times.

 Although I don't know how that could be after reading the code it would be
 disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
 just to notice fstat is still faster for $unobvious_reason...

Well, the good thing here is that we are really only concerned with
gross effects.  It's pretty obvious from the numbers I posted upthread
that the problem is related to lock contention.  If that gets fixed,
and lseek is still 20% slower under some set of circumstances, it's
not clear that we're really gonna care.  I mean, maybe it would be
nice to avoid going to the kernel at all here just so we're immune to
possible inefficiencies in other operating systems (it would be nice
if someone could repeat these tests on a big SMP box running Windows
and/or one of BSD systems) and to save the overhead of a system call,
but those effects are pretty tiny.  We could spend a lot of time
optimizing other things before that one percolated up to the top of
the heap, at least based on what I've seen so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of lun ago 08 13:21:17 -0400 2011:
 On Mon, Aug 8, 2011 at 8:27 PM, Alvaro Herrera
 alvhe...@commandprompt.comwrote:
 
  An array of relopt_string?  Isn't that a bit strange?  If I recall
  correctly, the point of this was to be able to allocate the
  relopt_string struct and the char array itself as a single palloc unit,
  in a single call somewhere in the reloptions API (which was convoluted
  in some points precisely to let the string case work).  I don't have the
  details of this fresh in my mind though.  It certainly worked with more
  than one string option when I committed it, IIRC.
 
 Yes, it seems strange. But it also seems that both were added by your commit
 of table-based parser:
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ba748f7a11ef884277b61d1708a17a44acfd1736

Oh, you're adding options directly to stringRelOpts?  Hmm, I can't
remember whether I tried to do that; I have memory of testing string
options via add_string_reloption ... and in reflection, it seems obvious
that it would fail.

Perhaps the easiest way to fix it is as you suggest, by declaring the
struct to take a pointer rather than the value directly.  Not sure how
to make both cases work sanely; the add_string_reloption path will need
updates.  I don't have time to work on it right now though.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Florian Pflug
On Aug8, 2011, at 17:35 , Florian Pflug wrote:
 I think it would be helpful if we had a more precise idea about the
 intended use-cases. So far, the only use-case that has been described in
 detail is the one which made Kevin aware of the problem. But if I
 understood Kevin correctly, that fact that they use BEFORE and not AFTER
 triggers it more of an accident than the result of a conscious design
 decision. Though he did also mention that there might actually be advantages
 to using BEFORE instead of AFTER triggers, because that prevents other
 triggers from seeing a non-consistent state.

I pondered this some more, and came up with the following use-case of
same-row modification from a BEFORE trigger which are not easily moved
to an AFTER trigger.

Assume you have a partitioned table T with two partitions. Partition
Tcompleted contains records which are completed in some sense, while
Tcommencing contains records which are not. A record's state is indicated
by a boolean column completed, and CHECK constraints guarantee that
completed is indeed false for all records in Tcommencing and true for all
records in Tcompleted. Now, if the partitioning is supposed to be transparent
to applications, a record must automatically be moved from Tcommencing to
Tcompleted if it's completed flag is changed from false to true. The
following BEFORE UPDATE trigger on Tcommencing accomplishes that.

  IF NEW.completed THEN
DELETE FROM Tcommencing WHERE id = OLD.id;
INSERT INTO Tcompleted VALUES (NEW.*); -- Probably a syntax error,
   -- but intent is clear
RETURN NULL;
  ELSE
RETURN NEW;
  END IF;

Doing this from within an AFTER trigger doesn't work because the CHECK
constraint on Tcommencing would prevent the UPDATE from occurring.

This would also fail if we did as I suggested, and forbade any modifications
to a row for which a BEFORE trigger was in progress. I take this as evidence
that my suggestion was miss-guided and we should in fact only report an error
if the row was modified since the before trigger started *and* the trigger
returns non-NULL.

I still believe we shouldn't distinguish between UPDATES and DELETES, so
a DELETE should fail if the row is updated from within a BEFORE DELETE trigger
which returns non-NULL.

BTW, the case of multiple BEFORE triggers also raises some interesting 
questions.
Assume there are two BEFORE UPDATE triggers T1 and T2 defined on a table T, and 
T1
causes a same-row UPDATE and returns non-NULL. Also assume that T1 is coded in
a way that prevents infinite recursion in that case (i.e., the same-row UPDATE
is skipped if is T1 invocation already as a result of a same-row UPDATE). The
sequence of invocations after an update of a row would be.

1) T1 (for the original UPDATE)
2) T1 (for the same-row UPDATE initiated in (1))
3) T2 (for the same-row UPDATE initiated in (1))
4) T2 (for the original UPDATE)

Now, T2 in invoked in (4) for a row which isn't visible anymore. So I'm thinking
that maybe we should check after every trigger invocation whether or not a 
same-row
UPDATE occurred, and immediately raise an error if it did and the trigger didn't
return NULL, and not delay the check until all triggers have run. It seems 
unlikely
that delaying the check until all triggers have run adds any significant 
functionality.
For that to be the case, T2 would have to know whether T1 did a same-row 
update, and
return NULL, since otherwise we'd raise an error anyway. That beauty of doing 
the
check after every trigger lies in the fact that it makes the coding rule Thou
shall return NULL if thou dared to modify thine own row local (i.e. 
per-trigger)
instead of global (i.e. per trigger-chain), and thus less likely to be violated 
by
some strange interactions of triggers serving distinct purposes.

To summarize, here's what I currently believe would be a sensible approach:

  After every BEFORE trigger invocation, if the trigger returned non-NULL, 
check if
  latest row version is still the same as when the trigger started. If not, 
complain.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Robert Haas
On Fri, Aug 5, 2011 at 7:25 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Fri, Aug 5, 2011 at 8:32 AM, Robert Haas robertmh...@gmail.com wrote:
 I guess my vote is to make the SQL/MED stuff show the description only
 in verbose mode, and always at the end; and revise what we did with
 \dL to put the description at the end even in verbose mode.

 Yeah, that sounds fine to me. I've revised the residual patch to do
 so, as well as incorporated your earlier suggestion about having \dD
 and \dc only display descriptions in verbose-mode.

 I did more testing, and found and fixed some brokenness related to
 d.objsubid I had introduced into the listConversions() query, as well
 as improved the version checking in objectDescription(). Updated patch
 attached.

OK, I've now committed most of this, with some additions to the
documentation.  Remaining bits attached.

I am a bit confused as to why we have both \det and \dE.  They seem
redundant.  Shouldn't we rip one of those out?  IMHO, \det should be
the one to go, as it could be useful to do, e.g. \dtvE, which isn't
going to work with the \det syntax.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


describe_comments.v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Josh Kupershmidt
On Mon, Aug 8, 2011 at 4:34 PM, Robert Haas robertmh...@gmail.com wrote:
 OK, I've now committed most of this, with some additions to the
 documentation.  Remaining bits attached.

Looks good, thanks.

 I am a bit confused as to why we have both \det and \dE.  They seem
 redundant.  Shouldn't we rip one of those out?  IMHO, \det should be
 the one to go, as it could be useful to do, e.g. \dtvE, which isn't
 going to work with the \det syntax.

Yeah, I was wondering that myself. At first I thought maybe someone
added in one without being aware of the other, but it looks like both
\dE and \det got added in by commit
0d692a0dc9f0e532c67c577187fe5d7d323cb95b. They are using different
queries internally (pg_class vs. pg_foreign_table), but I doubt end
users would care about that. Or perhaps the author just wanted a
command name similar to \dew and \des.

+1 for getting rid of one of them; I don't really care which one gets
the axe. Though we should make sure to be able to show all possible
columns in whichever command we keep (i.e. add Options column to
\dE+ if we keep that one). BTW, I haven't bothered setting up
functioning foreign tables yet, but is Size always going to be 0
bytes in \dE+?

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Josh Kupershmidt
On Mon, Aug 8, 2011 at 6:01 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 (i.e. add Options column to \dE+ if we keep that one).

Oh nevermind, Options is displayed by \d+ foreign_table_name.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Joe Abbate
Hi,

I'm trying to query the catalogs to select only the user-defined CASTs
(my test db only has one such CAST).  Looking at pg_dump.c, I've come up
with the following so far:

SELECT castsource::regtype AS source,
   casttarget::regtype AS target,
   castfunc::regprocedure AS function,
   castcontext AS context, castmethod AS method,
   description
FROM pg_cast c
 LEFT JOIN pg_description d
  ON (c.oid = d.objoid AND d.objsubid = 0)
WHERE (castfunc != 0
   AND substring(castfunc::regproc::text for 3) != 'pg_')
ORDER BY castsource, casttarget;

This excludes the vast majority of internal casts (172 of them) where
castfunc is 0 or castfunc::regproc causes castfunc to show up with
pg_catalog. prepended to the function name.  However, this still pulls
19 other rows, as shown in the excerpt output below (after setting
search_path to pg_catalog):

   source   |target |  function
  | context | method |   description
---+---+-+-++-
 bigint| regproc   | oid(bigint)
 | i   | f  |
 bigint| oid   | oid(bigint)
 | i   | f  |
...
smallint  | boolean   | public.int2_bool(smallint)
| e   | f  | Test comment for cast 1
 integer   | boolean   | bool(integer)
 | e   | f  |
...
 interval  | reltime   | reltime(interval)
 | a   | f  |
 bit varying   | bit varying   | varbit(bit varying,integer,boolean)
| i   | f  |
(20 rows)

The smallint AS boolean CAST is mine and is the only one I want to retrieve.

It seems the only way out is to do something like a 9-way join between
pg_cast, pg_type, pg_proc and pg_namespace to test the source, target
and function namespaces much as dumpCast() does in pg_dump.c.  Before I
go that route, I'd thought I'd check with -hackers to see if there's a
simpler way.

Regards,

Joe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Joe Abbate
On 08/08/2011 06:31 PM, Joe Abbate wrote:
 It seems the only way out is to do something like a 9-way join between
 pg_cast, pg_type, pg_proc and pg_namespace to test the source, target
 and function namespaces much as dumpCast() does in pg_dump.c.  Before I
 go that route, I'd thought I'd check with -hackers to see if there's a
 simpler way.

Well, for my specific example (both source and target are pg_catalog
types and the function is in the public schema), the following query
does the trick:

SELECT castsource::regtype AS source,
   casttarget::regtype AS target,
   castfunc::regprocedure AS function,
   castcontext AS context, castmethod AS method,
   description
FROM pg_cast c
 JOIN pg_type s ON (castsource = s.oid)
  JOIN pg_namespace sn ON (s.typnamespace = sn.oid)
 JOIN pg_type t ON (casttarget = t.oid)
  JOIN pg_namespace tn ON (t.typnamespace = tn.oid)
 LEFT JOIN pg_proc p ON (castfunc = p.oid)
  LEFT JOIN pg_namespace pn ON (p.pronamespace = pn.oid)
 LEFT JOIN pg_description d
  ON (c.oid = d.objoid AND d.objsubid = 0)
WHERE (substring(sn.nspname for 3) = 'pg_'
   AND substring(tn.nspname for 3) = 'pg_'
   AND castfunc != 0 AND substring(pn.nspname for 3) != 'pg_')
ORDER BY castsource, casttarget;

I realize that for the general case, the WHERE clause has to be expanded
(and may look much, much uglier). Nevertheless, if somebody has some
simplifications, I'd be glad to hear them.

Joe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Tom Lane
Joe Abbate j...@freedomcircle.com writes:
 I'm trying to query the catalogs to select only the user-defined CASTs

This is rather difficult to do, actually, because pg_cast stores
neither an owner nor a schema for casts, which eliminates all of the
principled ways in which you might decide that a cast belongs to the
system or the user.

An easy but unprincipled way to do it is

select ... from pg_cast c where c.oid = 16384;

What that really does is eliminate the casts that were installed during
initdb, which are at least a subset of the system ones, and might be
all of them depending on what you feel a system cast is.  The main
shortcoming of it is that there's no very good way to eliminate casts
installed by extensions, should you want to not consider those user
casts.

Another approach is to check pg_depend.  A cast installed by initdb will
match a pin entry in pg_depend (refclassid = pg_cast, refobjid =
cast's OID, deptype = 'p').  You're still out of luck for distinguishing
extension members in existing releases, but in 9.1 and up it'll be
possible to identify casts belonging to extensions from pg_depend
entries.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers