Re: [HACKERS] Removing special case OID generation

2012-02-11 Thread Simon Riggs
On Fri, Feb 10, 2012 at 10:38 PM, Jim Nasby j...@nasby.net wrote:
 On 2/7/12 8:14 AM, Alvaro Herrera wrote:

 Having one sequence for each toast table could be wasteful though.  I
 mean, sequences are not the best use of shared buffer cache currently.
 If we could have more than one sequence data in a shared buffer page,
 things would be different.  Not sure how serious this really is.


 This would actually be an argument for supporting multiple page sizes... too
 bad that's such a beast.

 FWIW, from our most complex production database:

 cnuapp_p...@postgres08.obr=# select relkind, count(*) from pg_class group by
 1;
  relkind | count
 -+---
  S       |   522
  r       |  1058
  t       |   698
  i       |  2894
  v       |   221
  c       |    12
 (6 rows)


Yeh, I was thinking we would do well to implement cached sequences for
say first 1000 sequences.

That would mean we'd only use a few thousand bytes of memory rather
than 4 MB for your sequences.

Idea would be to make Sequences as fast as OIDs and get rid of the
weird OID code.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: Making TRUNCATE more MVCC-safe

2012-02-11 Thread Dan Ports
On Fri, Feb 10, 2012 at 01:59:18PM -0500, Robert Haas wrote:
 I guess I'm not particularly excited by the idea of trying to make
 TRUNCATE MVCC-safe.  I notice that the example involves the REPEATABLE
 READ isolation level, which is already known to be busted in a variety
 of ways; that's why we now have SERIALIZABLE, and why most people use
 READ COMMITTED.  Are there examples of this behavior at other
 isolation levels?

Marti's example works for SERIALIZABLE isolation too. In general, when
DDL operations weren't previously MVCC-safe under REPEATABLE READ, we
didn't change that in SERIALIZABLE.

There's some SSI code for TRUNCATE TABLE that tries to do something
reasonable, and it catches some (more subtle) anomalies involving
concurrent truncates -- but it can only do so much when TRUNCATE itself
isn't MVCC-safe. I expect that the combination of that code and this
patch would ensure full serializability for TRUNCATE operations.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new keywords SNAPSHOT and TYPES to the keyword list in gram.

2012-02-11 Thread Alex Hunsaker
On Thu, Feb 9, 2012 at 11:30, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue feb 09 12:17:59 -0300 2012:

 FWIW that script is throwing a warning here:
 Use of assignment to $[ is deprecated at 
 /pgsql/source/HEAD/src/tools/check_keywords.pl line 19.

 Any Perl hackers want to improve that script?

The attached 2 liner does it for me.
*** a/src/tools/check_keywords.pl
--- b/src/tools/check_keywords.pl
***
*** 16,22  if (@ARGV) {
  	$path = .;
  }
  
- $[ = 1;			# set array base to 1
  $, = ' ';		# set output field separator
  $\ = \n;		# set output record separator
  
--- 16,21 
***
*** 60,66  line: while (GRAM) {
  $n = (@arr = split(' ', $S));
  
  # Ok, we're in a keyword list. Go through each field in turn
! for (my $fieldIndexer = 1; $fieldIndexer = $n; $fieldIndexer++) {
  	if ($arr[$fieldIndexer] eq '*/'  $comment) {
  	$comment = 0;
  	next;
--- 59,65 
  $n = (@arr = split(' ', $S));
  
  # Ok, we're in a keyword list. Go through each field in turn
! for (my $fieldIndexer = 0; $fieldIndexer  $n; $fieldIndexer++) {
  	if ($arr[$fieldIndexer] eq '*/'  $comment) {
  	$comment = 0;
  	next;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only quals vs. security_barrier views

2012-02-11 Thread Jesper Krogh

On 2012-02-09 22:17, Jesper Krogh wrote:

On 2012-02-09 21:09, Robert Haas wrote:

That doesn't make sense to me.  If you probe index A for rows where a
= 1 and find that CTID (100,1) is such a row, and now want to return a
column value b that is not present in that index, the fastest way to
get the row is going to be to fetch block 100 from the heap and return
the data out of the first tuple.  To get the value out of some other
index that does include column b would require scanning the entire
index looking for that CTID, just so you could then grab the
corresponding index tuple, which wouldn't make any sense at all.


You're right, in my head, everything it wired up against my primary
keys, of-course that isn't the case for the DB. Sorry for the noise.


Ok, but there are still cases where we don't even need to construct
a data tuple at all:

2012-02-11 13:14:01.579 jk=# explain select count(*) from testtable 
where fts @@ to_tsquery('english','test1');

QUERY PLAN
---
 Aggregate  (cost=31.24..31.25 rows=1 width=0)
   -  Bitmap Heap Scan on testtable  (cost=16.03..31.23 rows=4 width=0)
 Recheck Cond: (fts @@ '''test1'''::tsquery)
 -  Bitmap Index Scan on ftsid  (cost=0.00..16.03 rows=4 width=0)
   Index Cond: (fts @@ '''test1'''::tsquery)
(5 rows)


Another idea sprung into my head, that indices on (ctid,some mix of 
columns)

could actually serve as some kind of vertical partitioning of the table.

Wether it actually will me more efficient or not need to be tested.

Jesper

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-11 Thread Andrew Dunstan

On 02/09/2012 03:06 PM, Andrew Dunstan wrote:



On 02/09/2012 02:54 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

OK, the one place that needs to be fixed to avoid the crash caused by
the json regression tests with the original patch is in
 src/backend/parser/parse_expr.c:transformRowExpr().
Other candidates I have found that don't set colnames and should
probably use dummy names are:
   * src/backend/parser/gram.y (row: production)
   * 
src/backend/optimizer/prep/prepunion.c:adjust_appendrel_attrs_mutator()

   * src/backend/optimizer/util/var.c:flatten_join_alias_vars_mutator()

Hm, can't the last two get real column names from somewhere?


Possibly. I'll dig a bit deeper.




I've had a look at these two. It's at least not obvious to me how to do 
this simply, if at all. In the last case it looks like we'd need to 
process the object recursively just like we do to extract the field 
values, and I don't know where to get them in the appendrel case at all, 
possibly because I'm not very familiar with this code.


Do we actually need to bother with these cases? The regression tests 
pass without touching them, either because they don't matter or because 
we don't have a test for these cases that would tickle the assertion 
that was failing. If they don't matter, would it not be better just to 
note that in the code rather than building a list of field names for no 
good purpose?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-11 Thread Andrew Dunstan



On 02/10/2012 01:14 PM, Peter Eisentraut wrote:

The auto_explain module appears to create invalid JSON (in all versions
since 9.0).  For example, with the settings

auto_explain.log_format = 'json'
auto_explain.log_min_duration = 0

the query

select * from pg_type;

produces this in the log:

LOG:  duration: 529.808 ms  plan:
 [
   Query Text: select * from pg_type;,
   Plan: {
 Node Type: Seq Scan,
 Relation Name: pg_type,
 Alias: pg_type,
 Startup Cost: 0.00,
 Total Cost: 9.87,
 Plan Rows: 287,
 Plan Width: 611
   }
 ]

Note that at the top level, it uses the array delimiters [ ] for what is
actually an object (key/value).  Running this through a JSON parser will
fail.

By contrast, EXPLAIN (FORMAT JSON) on the command line produces:

 QUERY PLAN
---
  [
{
  Plan: {
Node Type: Seq Scan,
Relation Name: pg_type,
Alias: pg_type,
Startup Cost: 0.00,
Total Cost: 9.87,
Plan Rows: 287,
Plan Width: 611
  }
}
  ]
(1 row)





Yeah, looks like this dates back to when we first got JSON output.

Auto-explain does this:

 ExplainBeginOutput(es);
 ExplainQueryText(es, queryDesc);
 ExplainPrintPlan(es, queryDesc);
 ExplainEndOutput(es);


But ExplainBeginOutput says:

 case EXPLAIN_FORMAT_JSON:
  /* top-level structure is an array of plans */
  appendStringInfoChar(es-str, '[');


Now that's not true in the auto-explain case, which prints one query + 
one plan.


Since this is an exposed API, I don't think we can just change it. We 
probably need a new API that does the right thing for beginning and 
ending auto_explain output. (ExplainBeginLabeledOutput?)


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-02-11 Thread Kevin Grittner
Kevin Grittner  wrote:
Tom Lane wrote:
 
 I agree it's a bug that you can do what Kevin's example shows.
 
 I'll look at it and see if I can pull together a patch.
 
Attached.
 
Basically, if a GUC has a check function, this patch causes it to be
run on a RESET just like it is on a SET, to make sure that the
resulting value is valid to set within the context.  Some messages
needed adjustment.  While I was there, I made cod a little more
consistent among related GUCs.
 
I also added a little to the regression tests to cover this.
 
-Kevin




check-reset-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory usage during sorting

2012-02-11 Thread Jeff Janes
On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


 Just to give a standard review, this patch is one line change and
 applies cleanly, builds ok.

 I'm not pretty sure what exactly you're trying to accomplish, but it
 seems to me that it's avoiding the first dumptuples cycle by forcing
 availMem = 0 even if it's negative.

Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
tape-sort, it starts by calling WRITETUP a large number of time
consecutively, to work off the memory deficit incurred by the 3 blocks
per tape of tape overhead, and then after that calls WRITETUP about
once per puttuple..   Under my patch, it would only call WRITETUP
about once per puttuple, right from the beginning.

 I read your comments as it'd be
 avoiding to alternate reading/writing back and force with scattered
 memory failing memory cache much during merge phase, but actually it
 doesn't affect merge phase but only init-dump phase, does it?

It effects the building of the runs.  But this building of the runs is
not a simple dump, it is itself a mini merge phase, in which it merges
the existing in-memory priority queue against the still-incoming
tuples from the node which invoked the sort.  By using less memory
than it could, this means that the resulting runs are smaller than
they could be, and so will sometimes necessitate an additional layer
of merging later on.   (This effect is particularly large for the very
first run being built.  Generally by merging incoming tuples into the
memory-tuples, you can create runs that are 1.7 times the size of fits
in memory.  By wasting some memory, we are getting 1.7 the size of a
smaller starting point.  But for the first run, it is worse than that.
 Most of the benefit that leads to that 1.7 multiplier comes at the
very early stage of each run-build.  But by initially using the full
memory, then writing out a bunch of tuples without doing any merge of
the incoming, we have truncated the part that gives the most benefit.)

My analysis that the freed memory is never reused (because we refuse
to reuse it ourselves and it is too fragmented to be reused by anyone
else, like the palloc or VM system) only applies to the run-building
phase.  So never was a bit of an overstatement.  By the time the last
initial run is completely written out to tape, the heap used for the
priority queue should be totally empty.  So at this point the
allocator would have the chance to congeal all of the fragmented
memory back into larger chunks, or maybe it parcels the allocations
back out again in an order so that the unused space is contiguous and
could be meaningfully paged out.

But, it is it worth worrying about how much we fragment memory and if
we overshoot our promises by 10 or 20%?

 If so,
 I'm not so convinced your benchmark gave 3 % gain by this change.
 Correct me as I'm probably wrong.

I've now done more complete testing.  Building an index on an
200,000,000 row table with an integer column populated in random order
with integers from 1..500,000,000, non-unique, on a machine with 2GB
of RAM and 600MB of shared_buffers.

It improves things by 6-7 percent at the end of working mem size, the
rest are in the noise except at 77936 KB, where it reproducibly makes
things 4% worse, for reasons I haven't figured out.  So maybe the best
thing to do is, rather than micromanaging memory usage, simply don't
set maintenance_work_mem way to low.  (But, it is the default).

maintenance_work_mem%Change with patch
16384   6.2
19484   7.8
23170   -0.1
27554   0.1
32768   1.9
38968   -1.9
46341   0.4
55109   0.4
65536   0.6
77936   -4.3
92682   -0.3
110218  0.1
131072  1.7


 Anyway, it's nice to modify the comment just above the change, since
 it's now true with it.

Much of that comment is referring to other things (some of which I
don't understand).  How about an additional comment just before my
code to the gist of:

/* If we have already used, and thus fragmented, the memory then we
 * might as well continue to make use of it as no one else can.
 */

(That might not actually be true, if the tuples we are sorting are
very large, or perhaps if they arrive in already reverse sorted order)

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpgsql leaking memory when stringifying datums

2012-02-11 Thread Tom Lane
I wrote:
 BTW, it occurs to me to wonder whether we need to worry about such
 subsidiary leaks in type input functions as well.

Sure enough, once you find an input function that leaks memory, there's
trouble:

create type myrow as (f1 text, f2 text, f3 text);

create or replace function leak_assign() returns void as $$
declare
t myrow[];
i int;
begin
for i in 1..1000 loop
  t := '{(abcd,efg' || ',hij), (a,b,c)}';
end loop;
end;
$$ language plpgsql;

So the attached third try also moves the input function calls in
exec_cast_value into the short-lived context, and rejiggers callers as
necessary to deal with that.  This actually ends up simpler and probably
faster than the original coding, because we are able to get rid of some
ad-hoc data copying and pfree'ing, and most of the performance-critical
code paths already had exec_eval_cleanup calls anyway.

Also, you wrote:
 With that I was still left with a leak in the typecast() test function
 and it turns out that sticking a exec_eval_cleanup into exec_move_row
 fixed it. The regression tests pass, but I'm not 100% sure if it's
 actually safe.

After some study I felt pretty nervous about that too.  It's safe enough
with the statement-level callers of exec_move_row, but there are several
calls from exec_assign_value, whose API contract says specifically that
it *won't* call exec_eval_cleanup.  Even if it works today, that's a bug
waiting to happen.  So I took the exec_eval_cleanup back out of
exec_move_row, and instead made all the statement-level callers do it.

I think this version is ready to go, so barring objections I'll set to
work on back-patching it.

regards, tom lane


diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index bf952b62478792b5564fe2af744a318322ea197c..0b126a3f4ac0e16e4ae19b8507c5c1352842a7e2 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** static void exec_move_row(PLpgSQL_execst
*** 188,201 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(Datum value, Oid valtype);
! static Datum exec_cast_value(Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
--- 188,204 
  static HeapTuple make_tuple_from_row(PLpgSQL_execstate *estate,
  	PLpgSQL_row *row,
  	TupleDesc tupdesc);
! static char *convert_value_to_string(PLpgSQL_execstate *estate,
! 		Datum value, Oid valtype);
! static Datum exec_cast_value(PLpgSQL_execstate *estate,
! Datum value, Oid valtype,
  Oid reqtype,
  FmgrInfo *reqinput,
  Oid reqtypioparam,
  int32 reqtypmod,
  bool isnull);
! static Datum exec_simple_cast_value(PLpgSQL_execstate *estate,
! 	   Datum value, Oid valtype,
  	   Oid reqtype, int32 reqtypmod,
  	   bool isnull);
  static void exec_init_tuple_store(PLpgSQL_execstate *estate);
*** plpgsql_exec_function(PLpgSQL_function *
*** 295,300 
--- 298,305 
  		/* If arg is null, treat it as an empty row */
  		exec_move_row(estate, NULL, row, NULL, NULL);
  	}
+ 	/* clean up after exec_move_row() */
+ 	exec_eval_cleanup(estate);
  }
  break;
  
*** plpgsql_exec_function(PLpgSQL_function *
*** 430,436 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(estate.retval, estate.rettype,
  			func-fn_rettype,
  			(func-fn_retinput),
  			func-fn_rettypioparam,
--- 435,443 
  		else
  		{
  			/* Cast value to proper type */
! 			estate.retval = exec_cast_value(estate,
! 			estate.retval,
! 			estate.rettype,
  			func-fn_rettype,
  			(func-fn_retinput),
  			func-fn_rettypioparam,
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1757,1763 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt-lower, isnull, valtype);
! 	value = exec_cast_value(value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
--- 1764,1770 
  	 * Get the value of the lower bound
  	 */
  	value = exec_eval_expr(estate, stmt-lower, isnull, valtype);
! 	value = exec_cast_value(estate, value, valtype, var-datatype-typoid,
  			(var-datatype-typinput),
  			var-datatype-typioparam,
  			var-datatype-atttypmod, isnull);
*** exec_stmt_fori(PLpgSQL_execstate *estate
*** 1772,1778 
  	 * Get the value of the upper bound
  	 */
  	value = exec_eval_expr(estate, stmt-upper, isnull, valtype);
! 	value = exec_cast_value(value, 

Re: [HACKERS] auto_explain produces invalid JSON

2012-02-11 Thread Andrew Dunstan



On 02/11/2012 01:18 PM, Andrew Dunstan wrote:



On 02/10/2012 01:14 PM, Peter Eisentraut wrote:

 [ auto-explain JSON output should be an object instead of an array ]




Yeah, looks like this dates back to when we first got JSON output.

Auto-explain does this:

 ExplainBeginOutput(es);
 ExplainQueryText(es, queryDesc);
 ExplainPrintPlan(es, queryDesc);
 ExplainEndOutput(es);


But ExplainBeginOutput says:

 case EXPLAIN_FORMAT_JSON:
  /* top-level structure is an array of plans */
  appendStringInfoChar(es-str, '[');


Now that's not true in the auto-explain case, which prints one query + 
one plan.


Since this is an exposed API, I don't think we can just change it. We 
probably need a new API that does the right thing for beginning and 
ending auto_explain output. (ExplainBeginLabeledOutput?)






PFA a patch along these lines, which seems to do the Right Thing (tm)

cheers

andrew


*** a/contrib/auto_explain/auto_explain.c
--- b/contrib/auto_explain/auto_explain.c
***
*** 290,299  explain_ExecutorEnd(QueryDesc *queryDesc)
  			es.buffers = (es.analyze  auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
! 			ExplainBeginOutput(es);
  			ExplainQueryText(es, queryDesc);
  			ExplainPrintPlan(es, queryDesc);
! 			ExplainEndOutput(es);
  
  			/* Remove last line break */
  			if (es.str-len  0  es.str-data[es.str-len - 1] == '\n')
--- 290,299 
  			es.buffers = (es.analyze  auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
! 			ExplainBeginLabeledOutput(es);
  			ExplainQueryText(es, queryDesc);
  			ExplainPrintPlan(es, queryDesc);
! 			ExplainEndLabeledOutput(es);
  
  			/* Remove last line break */
  			if (es.str-len  0  es.str-data[es.str-len - 1] == '\n')
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 2258,2263  ExplainEndOutput(ExplainState *es)
--- 2258,2302 
  }
  
  /*
+  * Emit the start-of-output boilerplate, if it's labeled, as auto_explain
+  * JSON is. For non-json cases, fall back on ExplainBeginOutput.
+  *
+  */
+ void
+ ExplainBeginLabeledOutput(ExplainState *es)
+ {
+ 	switch (es-format)
+ 	{
+ 		case EXPLAIN_FORMAT_JSON:
+ 			/* top level is an object of labeled items */
+ 			appendStringInfoChar(es-str, '{');
+ 			es-grouping_stack = lcons_int(0, es-grouping_stack);
+ 			es-indent++;
+ 			break;
+ 		default:
+ 			ExplainBeginOutput(es);
+ 	}
+ }
+ 
+ /* 
+  * Companion to ExplainBeginLabeledOutput
+  */
+ void
+ ExplainEndLabeledOutput(ExplainState *es)
+ {
+ 	switch (es-format)
+ 	{
+ 		case EXPLAIN_FORMAT_JSON:
+ 			es-indent--;
+ 			appendStringInfoString(es-str, \n});
+ 			es-grouping_stack = list_delete_first(es-grouping_stack);
+ 			break;
+ 		default:
+ 			ExplainEndOutput(es);
+ 	}
+ }
+ 
+ /*
   * Put an appropriate separator between multiple plans
   */
  void
*** a/src/include/commands/explain.h
--- b/src/include/commands/explain.h
***
*** 71,76  extern void ExplainQueryText(ExplainState *es, QueryDesc *queryDesc);
--- 71,78 
  
  extern void ExplainBeginOutput(ExplainState *es);
  extern void ExplainEndOutput(ExplainState *es);
+ extern void ExplainBeginLabeledOutput(ExplainState *es);
+ extern void ExplainEndLabeledOutput(ExplainState *es);
  extern void ExplainSeparatePlans(ExplainState *es);
  
  extern void ExplainPropertyList(const char *qlabel, List *data,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-11 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 But ExplainBeginOutput says:

   case EXPLAIN_FORMAT_JSON:
/* top-level structure is an array of plans */
appendStringInfoChar(es-str, '[');

 Now that's not true in the auto-explain case, which prints one query + 
 one plan.

What about queries that expand to multiple plans because of rules?

 Since this is an exposed API, I don't think we can just change it. We 
 probably need a new API that does the right thing for beginning and 
 ending auto_explain output. (ExplainBeginLabeledOutput?)

I'm inclined to think that this is auto_explain's error, not that of
the core code, ie we should be changing the output.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-11 Thread Andrew Dunstan



On 02/11/2012 03:22 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

But ExplainBeginOutput says:
   case EXPLAIN_FORMAT_JSON:
/* top-level structure is an array of plans */
appendStringInfoChar(es-str, '[');
Now that's not true in the auto-explain case, which prints one query +
one plan.

What about queries that expand to multiple plans because of rules?


Oh, hmm, good point.





Since this is an exposed API, I don't think we can just change it. We
probably need a new API that does the right thing for beginning and
ending auto_explain output. (ExplainBeginLabeledOutput?)

I'm inclined to think that this is auto_explain's error, not that of
the core code, ie we should be changing the output.





It looks like it will be messy either way. ;-(

cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] auto_explain produces invalid JSON

2012-02-11 Thread Andrew Dunstan



On 02/11/2012 03:22 PM, Tom Lane wrote:

Andrew Dunstanand...@dunslane.net  writes:

But ExplainBeginOutput says:
   case EXPLAIN_FORMAT_JSON:
/* top-level structure is an array of plans */
appendStringInfoChar(es-str, '[');
Now that's not true in the auto-explain case, which prints one query +
one plan.

What about queries that expand to multiple plans because of rules?



... and the answer is it logs them in separate pieces of JSON.




Since this is an exposed API, I don't think we can just change it. We
probably need a new API that does the right thing for beginning and
ending auto_explain output. (ExplainBeginLabeledOutput?)

I'm inclined to think that this is auto_explain's error, not that of
the core code, ie we should be changing the output.





Well, maybe this is more to your taste, although it strikes me as more 
than something of a kludge. At least it's short :-)


cheers

andrew

*** a/contrib/auto_explain/auto_explain.c
--- b/contrib/auto_explain/auto_explain.c
***
*** 299,304  explain_ExecutorEnd(QueryDesc *queryDesc)
--- 299,311 
  			if (es.str-len  0  es.str-data[es.str-len - 1] == '\n')
  es.str-data[--es.str-len] = '\0';
  
+ 			/* Fix JSON to output an object */
+ 			if (auto_explain_log_format == EXPLAIN_FORMAT_JSON)
+ 			{
+ es.str-data[0] = '{';
+ es.str-data[es.str-len - 1] = '}';
+ 			}
+ 
  			/*
  			 * Note: we rely on the existing logging of context or
  			 * debug_query_string to identify just which statement is being

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pl/python long-lived allocations in datum-dict transformation

2012-02-11 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= wulc...@wulczer.org writes:
 This is annoying for functions that plough through large tables, doing
 some calculation. Attached is a patch that does the conversion of
 PostgreSQL Datums into Python dict objects in a scratch memory context
 that gets reset every time.

As best I can tell, this patch proposes creating a new, separate context
(chewing up 8KB+) for every plpython procedure that's ever used in a
given session.  This cure could easily be worse than the disease as far
as total space consumption is concerned.  What's more, it's unclear that
it won't malfunction altogether if the function is used recursively
(ie, what if PLyDict_FromTuple ends up calling the same function again?)
Can't you fix it so that the temp context is associated with a
particular function execution, rather than being statically allocated
per-function?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-11 Thread Jeff Janes
On Tue, Feb 7, 2012 at 2:06 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 02/07/2012 03:23 PM, Bruce Momjian wrote:

 Where did you see that there will be an improvement in the 9.2
 documentation?  I don't see an improvement.


 I commented that I'm hoping for an improvement in the documentation of how
 much timing overhead impacts attempts to measure this area better.  That's
 from the add timing of buffer I/O requests feature submission.  I'm not
 sure if Bene read too much into that or not; I didn't mean to imply that the
 docs around random_page_cost have gotten better.

 This particular complaint is extremely common though, seems to pop up on one
 of the lists a few times each year.  Your suggested doc fix is fine as a
 quick one, but I think it might be worth expanding further on this topic.
  Something discussing SSDs seems due here too.  Here's a first draft of a
 longer discussion, to be inserted just after where it states the default
 value is 4.0:

 True random access to mechanical disk storage will normally be more
 expensive than this default suggests.  The value used is lower to reflect
 caching effects.  Some common random accesses to disk, such as indexed
 reads, are considered likely to be in cache.  The default value can be
 thought of as modeling random access as 40 times as expensive as sequential,
 while expecting that 90% of random reads will actually be cached.

For these numbers to work out to 4, we must also be assuming that
virtually zero of the sequentially read pages are cached.  Is that a
realistic assumption?  If the table is accessed only by seq scans, the
ring buffer may prevent it from getting cached (although even then it
could very well be the OS cache as that doesn't respect the ring
buffer), but it would be pretty common for other parts of the
application to access the same table via index scan, and so cause
substantial parts of it to be cached.

But I can see that that would rapidly get too complicated to discuss
in the documentation.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-11 Thread Jeff Janes
On Tue, Feb 7, 2012 at 4:58 PM, Bruce Momjian br...@momjian.us wrote:
 I was initially concerned that tuning advice in this part of the docs
 would look out of place, but now see the 25% shared_buffers
 recommentation, and it looks fine, so we are OK.  (Should we caution
 against more than 8GB of shared buffers?  I don't see that in the docs.)

Has it ever been well-characterized what the problem is with 8GB?
I've used shared buffers above that size for testing purposes and
could never provoke a problem with it.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-02-11 Thread Jeff Janes
On Mon, Feb 6, 2012 at 6:38 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Feb 4, 2012 at 2:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 We really need to nail that down.  Could you post the scripts (on the
 wiki) you use for running the benchmark and making the graph?  I'd
 like to see how much work it would be for me to change it to detect
 checkpoints and do something like color the markers blue during
 checkpoints and red elsewhen.

 They're pretty crude - I've attached them here.

Thanks.  But given Greg's comments about pgbench-tools, I'll probably
work on learning that instead.


 Also, I'm not sure how bad that graph really is.  The overall
 throughput is more variable, and there are a few latency spikes but
 they are few.  The dominant feature is simply that the long-term
 average is less than the initial burst.Of course the goal is to have
 a high level of throughput with a smooth latency under sustained
 conditions.  But to expect that that long-sustained smooth level of
 throughput be identical to the initial burst throughput sounds like
 more of a fantasy than a goal.

 That's probably true, but the drop-off is currently quite extreme.
 The fact that disabling full_page_writes causes throughput to increase
 by 4x is dismaying, at least to me.

I just meant that the latest graph, with full page writes off, is not that bad.

The ones with fpw on are definitely bad, more due to the latency
spikes than the throughput.



 If we want to accept the lowered
 throughput and work on the what variability/spikes are there, I think
 a good approach would be to take the long term TPS average, and dial
 the number of clients back until the initial burst TPS matches that
 long term average.  Then see if the spikes still exist over the long
 term using that dialed back number of clients.

 Hmm, I might be able to do that.

 I don't think the full-page-writes are leading to WALInsert
 contention, for example, because that would probably lead to smooth
 throughput decline, but not those latency spikes in which those entire
 seconds go by without transactions.

 Right.

 I doubt it is leading to general
 IO compaction, as the IO at that point should be pretty much
 sequential (the checkpoint has not yet reached the sync stage, the WAL
 is sequential).  So I bet that that is caused by fsyncs occurring at
 xlog segment switches, and the locking that that entails.

 That's definitely possible.

 If I
 recall, we can have a segment which is completely written to OS and in
 the process of being fsynced, and we can have another segment which is
 in some state of partially in wal_buffers and partly written out to OS
 cache, but that we can't start reusing the wal_buffers that were
 already written to OS for that segment (and therefore are
 theoretically available for reuse by the upcoming 3rd segment)  until
 the previous segments fsync has completed.  So all WALInsert's freeze.
  Or something like that.  This code has changed a bit since last time
 I studied it.

 Yeah, I need to better-characterize where the pauses are coming from,
 but I'm reluctant to invest too much effort in until Heikki's xlog
 scaling patch goes in, because I think that's going to change things
 enough that any work done now will mostly be wasted.

OK, You've scared me off from digging into the locking at wal switch
for now.  So instead I've spent some time today trying to unbreak the
xlog scaling patch and haven't had any luck.  Does anyone know if any
combination of that patches history + git master history had been
tested and verified to produce a recoverable WAL stream?  It is a
shame that make check doesn't test for that, but I don't know how to
make it do so.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-11 Thread Gaetano Mendola

On 19/09/2011 21:41, PostgreSQL - Hans-Jürgen Schönig wrote:


On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:


Greg Starkst...@mit.edu  writes:

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well.


I think the real problem would be that we are seldom sorting just the
key values.  If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly …




i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing 
- otherwise you have no chance to get anywhere close to card-bandwith.
if you look at two similar GPU codes which seem to do the same thing you 
might easily see that one is 10 times faster than the other - for bloody reason such as 
memory alignment, memory transaction size or whatever.
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible 
that i would be really surprised if somebody could come up with a solution 
which really comes close to what the GPU can do.
it would definitely be interesting to see a prototype, however.


Thrust Nvidia library provides the same sorting flexibility as postgres 
does.


// generate 32M random numbers on the host
thrust::host_vectorint h_vec(32  20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device
thrust::device_vectorint d_vec = h_vec;

// sort data on the device (846M keys per second on GeForce GTX 480)
thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());


as you can see the type to be ordered is template, and
the thrust::sort have also a version in where it takes the comparator to 
use.

So compared with pg_qsort  thrust::sort gives you the same flexibility.

http://docs.thrust.googlecode.com/hg/group__sorting.html

Regards
Gaetano Mendola











--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-11 Thread Gaetano Mendola

On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.


The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.



The trend is to have server capable of running CUDA providing GPU via 
external hardware (PCI Express interface with PCI Express switches), 
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


I did some experimenst timing the sort done with CUDA and the sort done 
with pg_qsort:

   CUDA  pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:~  21 ms,  ~  162 ms
100k integers:   ~   2 ms,  ~   13 ms

CUDA time has already in the copy operations (host-device, device-host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a 
Intel(R) Xeon(R) CPU X5650  @ 2.67GHz


Copy operations and kernel runs (the sort for instance) can run in 
parallel, so while you are sorting a batch of data, you can copy the 
next batch in parallel.


As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect 
in the near future the bottle neck of the device-host-device copies 
to have less impact.


I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for maybe five computers.
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimize referential integrity checks (todo item)

2012-02-11 Thread Vik Reykja
I decided to take a crack at the todo item created from the following post:
http://archives.postgresql.org/pgsql-performance/2005-10/msg00458.php

The attached patch makes the desired changes in both code and function
naming.

It seemed quite easy to do but wasn't marked as easy on the todo, so I'm
wondering if I've missed something.  All regression tests pass.
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
new file mode 100644
index 03a974a..107408f
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
*** static void ri_BuildQueryKeyFull(RI_Quer
*** 205,215 
  static void ri_BuildQueryKeyPkCheck(RI_QueryKey *key,
  		const RI_ConstraintInfo *riinfo,
  		int32 constr_queryno);
! static bool ri_KeysEqual(Relation rel, HeapTuple oldtup, HeapTuple newtup,
  			 const RI_ConstraintInfo *riinfo, bool rel_is_pk);
! static bool ri_AllKeysUnequal(Relation rel, HeapTuple oldtup, HeapTuple newtup,
    const RI_ConstraintInfo *riinfo, bool rel_is_pk);
! static bool ri_OneKeyEqual(Relation rel, int column,
  			   HeapTuple oldtup, HeapTuple newtup,
  			   const RI_ConstraintInfo *riinfo, bool rel_is_pk);
  static bool ri_AttributesEqual(Oid eq_opr, Oid typeid,
--- 205,215 
  static void ri_BuildQueryKeyPkCheck(RI_QueryKey *key,
  		const RI_ConstraintInfo *riinfo,
  		int32 constr_queryno);
! static bool ri_KeysUnchanged(Relation rel, HeapTuple oldtup, HeapTuple newtup,
  			 const RI_ConstraintInfo *riinfo, bool rel_is_pk);
! static bool ri_AllKeysChanged(Relation rel, HeapTuple oldtup, HeapTuple newtup,
    const RI_ConstraintInfo *riinfo, bool rel_is_pk);
! static bool ri_OneKeyUnchanged(Relation rel, int column,
  			   HeapTuple oldtup, HeapTuple newtup,
  			   const RI_ConstraintInfo *riinfo, bool rel_is_pk);
  static bool ri_AttributesEqual(Oid eq_opr, Oid typeid,
*** RI_FKey_noaction_upd(PG_FUNCTION_ARGS)
*** 932,940 
  			}
  
  			/*
! 			 * No need to check anything if old and new keys are equal
  			 */
! 			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowShareLock);
  return PointerGetDatum(NULL);
--- 932,940 
  			}
  
  			/*
! 			 * No need to check anything if old and new keys are unchanged
  			 */
! 			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowShareLock);
  return PointerGetDatum(NULL);
*** RI_FKey_cascade_upd(PG_FUNCTION_ARGS)
*** 1281,1289 
  			}
  
  			/*
! 			 * No need to do anything if old and new keys are equal
  			 */
! 			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowExclusiveLock);
  return PointerGetDatum(NULL);
--- 1281,1289 
  			}
  
  			/*
! 			 * No need to do anything if old and new keys are unchanged
  			 */
! 			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowExclusiveLock);
  return PointerGetDatum(NULL);
*** RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
*** 1646,1654 
  			}
  
  			/*
! 			 * No need to check anything if old and new keys are equal
  			 */
! 			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowShareLock);
  return PointerGetDatum(NULL);
--- 1646,1654 
  			}
  
  			/*
! 			 * No need to check anything if old and new keys are unchanged
  			 */
! 			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowShareLock);
  return PointerGetDatum(NULL);
*** RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
*** 1993,2001 
  			}
  
  			/*
! 			 * No need to do anything if old and new keys are equal
  			 */
! 			if (ri_KeysEqual(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowExclusiveLock);
  return PointerGetDatum(NULL);
--- 1993,2001 
  			}
  
  			/*
! 			 * No need to do anything if old and new keys are unchanged
  			 */
! 			if (ri_KeysUnchanged(pk_rel, old_row, new_row, riinfo, true))
  			{
  heap_close(fk_rel, RowExclusiveLock);
  return PointerGetDatum(NULL);
*** RI_FKey_setnull_upd(PG_FUNCTION_ARGS)
*** 2012,2024 
  			 * our cached plan, unless the update happens to change all
  			 * columns in the key.	Fortunately, for the most common case of a
  			 * single-column foreign key, this will be true.
- 			 *
- 			 * In case you're wondering, the inequality check works because we
- 			 * know that the old key value has no NULLs (see above).
  			 */
  
  			use_cached_query = (riinfo.confmatchtype == FKCONSTR_MATCH_FULL) ||
! ri_AllKeysUnequal(pk_rel, old_row, new_row,
    riinfo, true);
  
  			/*
--- 2012,2021 
  			 * our cached plan, unless the update happens to change all
  			 * columns in the key.	Fortunately, for the most common case of a
  			 * single-column foreign key, this will be true.