Re: [HACKERS] New replication mode: write

2012-01-24 Thread Fujii Masao
On Wed, Jan 25, 2012 at 5:28 AM, Simon Riggs  wrote:
> On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs  wrote:
>
>> Yes, it might be too hard, but lets look.
>
> Your committer has timed out ;-)
>
> committed write mode only

Thanks for the commit!

The apply mode is attractive, but I need more time to implement that completely.
I might not be able to complete that within this CF. So committing the
write mode
only is right decision, I think. If I have time after all of the
patches which I'm interested
in will have been committed, I will try the apply mode again, but
maybe for 9.3dev.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add new replication mode synchronous_commit = 'write'.

2012-01-24 Thread Fujii Masao
On Wed, Jan 25, 2012 at 5:35 AM, Jaime Casanova  wrote:
> On Tue, Jan 24, 2012 at 3:22 PM, Simon Riggs  wrote:
>> Add new replication mode synchronous_commit = 'write'.
>> Replication occurs only to memory on standby, not to disk,
>> so provides additional performance if user wishes to
>> reduce durability level slightly. Adds concept of multiple
>> independent sync rep queues.
>>
>> Fujii Masao and Simon Riggs
>>
>
> i guess, you should add the new value in postgresql.conf too.

Yes, I forgot to do that. Patch attached.

> just a question, why not add a flush value and make it behave like on?
> actually sync rep is on in both cases and having the different names
> makes more easy to unserstand what is happening.
> i only want to keep on for compatibility but i wouldn't mind if we
> remove it in favor of more descriptive names.

Also we should add "async" as an alias for "off"?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


noname
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Dave Page
On Tuesday, January 24, 2012, Stefan Kaltenbrunner 
wrote:
> On 01/24/2012 07:36 PM, Dave Page wrote:
>> What date & venue?
>>
>> On Tuesday, January 24, 2012, Joshua D. Drake 
wrote:
>>>
>>> Hello,
>>>
>>> The call for papers for PgNext (the old PgWest/PgEast) is now open:
>>>
>>> January 19th: Talk submission opens
>>> April 15th: Talk submission closes
>>> April 30th: Speaker notification
>>>
>>> Submit: https://www.postgresqlconference.org/talk_types
>
>
> the url has in big letters has "Denver Convention Center" on June
> 26th-29th, 2012...

Not too visible on my phone...

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Pavan Deolasee
On Wed, Jan 25, 2012 at 2:23 AM, Robert Haas  wrote:
> Early yesterday morning, I was able to use Nate Boley's test machine
> do a single 30-minute pgbench run at scale factor 300 using a variety
> of trees built with various patches, and with the -l option added to
> track latency on a per-transaction basis.  All tests were done using
> 32 clients and permanent tables.  The configuration was otherwise
> identical to that described here:
>
> http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com
>
> By doing this, I hoped to get a better understanding of (1) the
> effects of a scale factor too large to fit in shared_buffers, (2) what
> happens on a longer test run, and (3) how response time varies
> throughout the test.  First, here are the raw tps numbers:
>
> background-clean-slru-v2: tps = 2027.282539 (including connections 
> establishing)
> buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
> establishing)
> buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
> (including connections establishing)
> freelist-ok-v2: tps = 2467.065010 (including connections establishing)
> group-commit-2012-01-21: tps = 2205.128609 (including connections 
> establishing)
> master: tps = 2200.848350 (including connections establishing)
> removebufmgrfreelist-v1: tps = 2679.453056 (including connections 
> establishing)
> xloginsert-scale-6: tps = 3675.312202 (including connections establishing)
>
> Obviously these numbers are fairly noisy, especially since this is
> just one run, so the increases and decreases might not be all that
> meaningful.  Time permitting, I'll try to run some more tests to get
> my hands around that situation a little better,
>

This is nice. I am sure long running tests will point out many more
issues. If we are doing these tests, it might be more effective if we
run even longer runs, such as to get at least 3-4
checkpoints/vacuum/analyze (and other such events which can impact
final numbers either way) per test. Otherwise, one patch may stood out
if it avoids, say one checkpoint.

It would definitely help to log the checkpoint,
auto-vacuum/auto-analyze details and plot them on the graph to see if
the drop in performance has anything to do with these activities. It
might also be a good idea to collect pg statistics such as relation
sizes at the end of the run.

Thanks,
Pavan
-- 
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-24 Thread Simon Riggs
On Fri, Jan 20, 2012 at 10:30 PM, Heikki Linnakangas
 wrote:

> I spent some time cleaning this up. Details below, but here are the
> highlights:
>
> * Reverted the removal of wal_writer_delay
> * Removed heuristic on big flushes

No contested viewpoints on anything there.

> * Doesn't rely on WAL writer. Any process can act as the "leader" now.
> * Uses PGSemaphoreLock/Unlock instead of latches

Thanks for producing an alternate version, it will allow us to comment
on various approaches.

There is much yet to discuss so please don't think about committing
anything yet.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade with plpython is broken

2012-01-24 Thread Bruce Momjian
On Fri, Jan 20, 2012 at 07:01:46AM +0200, Peter Eisentraut wrote:
> On tor, 2012-01-19 at 17:04 -0500, Bruce Momjian wrote:
> > For that reason, I wonder if I should just hard-code the plpython
> > rename into the pg_upgrade test in check_loadable_libraries().
> 
> Yes, I haven't come up with a better solution either.
> 
> If this becomes a general problem, we might need to add a command line
> option to ignore certain names or something.  But not yet.

Well, the problem is a little more complex than reported.  It turns out
in PG 9.0 we kept the plpython.so file and symlinked plpython2.so to it.
In PG 9.1, we removed plpython.so, and only have plpython2.so, so the
problem is with PG >= 9.1, and does not affect 9.0, which explains why
we didn't get any 9.0 reports of a problem.

I have applied the attached patch to PG head and 9.1 to fix the library
checking problem.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/function.c b/contrib/pg_upgrade/function.c
new file mode 100644
index 54f139a..4505e51
*** a/contrib/pg_upgrade/function.c
--- b/contrib/pg_upgrade/function.c
*** check_loadable_libraries(void)
*** 228,235 
  		char	   *cmd = (char *) pg_malloc(8 + 2 * llen + 1);
  		PGresult   *res;
  
  		strcpy(cmd, "LOAD '");
! 		PQescapeStringConn(conn, cmd + 6, lib, llen, NULL);
  		strcat(cmd, "'");
  
  		res = PQexec(conn, cmd);
--- 228,251 
  		char	   *cmd = (char *) pg_malloc(8 + 2 * llen + 1);
  		PGresult   *res;
  
+ 		/*
+ 		 *	In Postgres 9.0, Python 3 support was added, and to do that, a
+ 		 *	plpython2u language was created with library name plpython2.so
+ 		 *	as a symbolic link to plpython.so.  In Postgres 9.1, only the
+ 		 *	plpython2.so library was created, and both plpythonu and
+ 		 *	plpython2u pointing to it.  For this reason, any reference to
+ 		 *	library name "plpython" in an old PG <= 9.1 cluster must look
+ 		 *	for "plpython2" in the new cluster.
+ 		 */
+ 		if (GET_MAJOR_VERSION(old_cluster.major_version) < 901 &&
+ 			strcmp(lib, "$libdir/plpython") == 0)
+ 		{
+ 			lib = "$libdir/plpython2";
+ 			llen = strlen(lib);
+ 		}
+ 
  		strcpy(cmd, "LOAD '");
! 		PQescapeStringConn(conn, cmd + strlen(cmd), lib, llen, NULL);
  		strcat(cmd, "'");
  
  		res = PQexec(conn, cmd);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix for pg_upgrade tablespace function usage

2012-01-24 Thread Bruce Momjian
I have applied the attached patch to git head to fix the new SQL tablespace
location function usage in pg_upgrade to properly check cluster version
numbers, and a fix missing table alias.

I found this problem during testing.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/pg_upgrade/info.c b/contrib/pg_upgrade/info.c
new file mode 100644
index e8361ce..692cdc2
*** a/contrib/pg_upgrade/info.c
--- b/contrib/pg_upgrade/info.c
*** get_db_infos(ClusterInfo *cluster)
*** 204,210 
/* we don't preserve pg_database.oid so we sort by name */
"ORDER BY 2",
/* 9.2 removed the spclocation column */
!   (GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ?
"t.spclocation" : 
"pg_catalog.pg_tablespace_location(t.oid) AS spclocation");
  
res = executeQueryOrDie(conn, "%s", query);
--- 204,210 
/* we don't preserve pg_database.oid so we sort by name */
"ORDER BY 2",
/* 9.2 removed the spclocation column */
!   (GET_MAJOR_VERSION(cluster->major_version) <= 901) ?
"t.spclocation" : 
"pg_catalog.pg_tablespace_location(t.oid) AS spclocation");
  
res = executeQueryOrDie(conn, "%s", query);
*** get_rel_infos(ClusterInfo *cluster, DbIn
*** 287,293 
/* we preserve pg_class.oid so we sort by it to match old/new */
 "ORDER BY 1;",
/* 9.2 removed the spclocation column */
!(GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ?
 "t.spclocation" : 
"pg_catalog.pg_tablespace_location(t.oid) AS spclocation",
/* see the comment at the top of old_8_3_create_sequence_script() */
 (GET_MAJOR_VERSION(old_cluster.major_version) <= 803) ?
--- 287,293 
/* we preserve pg_class.oid so we sort by it to match old/new */
 "ORDER BY 1;",
/* 9.2 removed the spclocation column */
!(GET_MAJOR_VERSION(cluster->major_version) <= 901) ?
 "t.spclocation" : 
"pg_catalog.pg_tablespace_location(t.oid) AS spclocation",
/* see the comment at the top of old_8_3_create_sequence_script() */
 (GET_MAJOR_VERSION(old_cluster.major_version) <= 803) ?
diff --git a/contrib/pg_upgrade/tablespace.c b/contrib/pg_upgrade/tablespace.c
new file mode 100644
index 11fd9d0..6b61f4b
*** a/contrib/pg_upgrade/tablespace.c
--- b/contrib/pg_upgrade/tablespace.c
*** get_tablespace_paths(void)
*** 53,59 
 "  spcname != 'pg_global'",
/* 9.2 removed the spclocation column */
(GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ?
!   "t.spclocation" : 
"pg_catalog.pg_tablespace_location(oid) AS spclocation");
  
res = executeQueryOrDie(conn, "%s", query);
  
--- 53,59 
 "  spcname != 'pg_global'",
/* 9.2 removed the spclocation column */
(GET_MAJOR_VERSION(old_cluster.major_version) <= 901) ?
!   "spclocation" : "pg_catalog.pg_tablespace_location(oid) 
AS spclocation");
  
res = executeQueryOrDie(conn, "%s", query);
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 8:13 PM, Tom Lane  wrote:
>> Well, the bytea experience was IMNSHO a complete disaster (It was
>> earlier mentioned that jdbc clients were silently corrupting bytea
>> datums) and should be held up as an example of how *not* to do things;
>
> Yeah.  In both cases, the (proposed) new output format is
> self-identifying *to clients that know what to look for*.Unfortunately
> it would only be the most anally-written pre-existing client code that
> would be likely to spit up on the unexpected variations.  What's much
> more likely to happen, and did happen in the bytea case, is silent data
> corruption.  The lack of redundancy in binary data makes this even more
> likely, and the documentation situation makes it even worse.  If we had
> had a clear binary-data format spec from day one that told people that
> they must check for unexpected contents of the flag field and fail, then
> maybe we could get away with considering not doing so to be a
> client-side bug ... but I really don't think we have much of a leg to
> stand on given the poor documentation we've provided.
>
>> In regards to the array optimization, I think it's great -- but if you
>> truly want to avoid blowing up user applications, it needs to be
>> disabled automatically.
>
> Right.  We need to fix things so that this format will not be sent to
> clients unless the client code has indicated ability to accept it.
> A GUC is a really poor proxy for that.

OK.  It seems clear to me at this point that there is no appetite for
this patch in its present form:

https://commitfest.postgresql.org/action/patch_view?id=715

Furthermore, while we haven't settled the question of exactly what a
good negotiation facility would look like, we seem to agree that a GUC
isn't it.  I think that means this isn't going to happen for 9.2, so
we should mark this patch Returned with Feedback and return to this
topic for 9.3.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Tom Lane
Merlin Moncure  writes:
> On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas  wrote:
>> I do wonder whether we are making a mountain out of a mole-hill here,
>> though.  If I properly understand the proposal on the table, which
>> it's possible that I don't, but if I do, the new format is
>> self-identifying: when the optimization is in use, it sets a bit that
>> previously would always have been clear.  So if we just go ahead and
>> change this, clients that have been updated to understand the new
>> format will work just fine.  The server uses the proposed optimization
>> only for arrays that meet certain criteria, so any properly updated
>> client must still be able to handle the case where that bit isn't set.
>>  On the flip side, clients that aren't expecting the new optimization
>> might break.  But that's, again, no different than what happened when
>> we changed the default bytea output format.  If you get bit, you
>> either update your client or shut off the optimization and deal with
>> the performance consequences of so doing.

> Well, the bytea experience was IMNSHO a complete disaster (It was
> earlier mentioned that jdbc clients were silently corrupting bytea
> datums) and should be held up as an example of how *not* to do things;

Yeah.  In both cases, the (proposed) new output format is
self-identifying *to clients that know what to look for*.  Unfortunately
it would only be the most anally-written pre-existing client code that
would be likely to spit up on the unexpected variations.  What's much
more likely to happen, and did happen in the bytea case, is silent data
corruption.  The lack of redundancy in binary data makes this even more
likely, and the documentation situation makes it even worse.  If we had
had a clear binary-data format spec from day one that told people that
they must check for unexpected contents of the flag field and fail, then
maybe we could get away with considering not doing so to be a
client-side bug ... but I really don't think we have much of a leg to
stand on given the poor documentation we've provided.

> In regards to the array optimization, I think it's great -- but if you
> truly want to avoid blowing up user applications, it needs to be
> disabled automatically.

Right.  We need to fix things so that this format will not be sent to
clients unless the client code has indicated ability to accept it.
A GUC is a really poor proxy for that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-01-24 Thread Tatsuo Ishii
> My test was run with synchronous_commit=off, so I didn't expect the
> group commit patch to have much of an impact.  I included it mostly to
> see whether by chance it helped anyway (since it also helps other WAL
> flushes, not just commits) or whether it caused any regression.

Oh, I see.

> One somewhat odd thing about these numbers is that, on permanent
> tables, all of the patches seemed to show regressions vs. master in
> single-client throughput.  That's a slightly difficult result to
> believe, though, so it's probably a testing artifact of some kind.

Maybe kernel cache effect?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Different error messages executing CREATE TABLE or ALTER TABLE to create a column "xmin"

2012-01-24 Thread Vik Reykja
I took my first stab at hacking the sources to fix the bug reported here:

http://archives.postgresql.org/pgsql-bugs/2012-01/msg00152.php

It's such a simple patch but it took me several hours with Google and IRC
and I'm sure I did many things wrong.  It seems to work as advertised,
though, so I'm submitting it.

I suppose since I have now submitted a patch, it is my moral obligation to
review at least one.  I'm not sure how helpful I'll be, but I'll go bite
the bullet and sign myself up anyway.
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
new file mode 100644
index cb8ac67..7555d19
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** ATExecAddColumn(List **wqueue, AlteredTa
*** 4235,4240 
--- 4235,4241 
  	List	   *children;
  	ListCell   *child;
  	AclResult	aclresult;
+ 	HeapTuple	attTuple;
  
  	/* At top level, permission check was done in ATPrepCmd, else do it */
  	if (recursing)
*** ATExecAddColumn(List **wqueue, AlteredTa
*** 4314,4326 
  	 * this test is deliberately not attisdropped-aware, since if one tries to
  	 * add a column matching a dropped column name, it's gonna fail anyway.
  	 */
! 	if (SearchSysCacheExists2(ATTNAME,
! 			  ObjectIdGetDatum(myrelid),
! 			  PointerGetDatum(colDef->colname)))
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg("column \"%s\" of relation \"%s\" already exists",
! 		colDef->colname, RelationGetRelationName(rel;
  
  	/* Determine the new attribute's number */
  	if (isOid)
--- 4315,4341 
  	 * this test is deliberately not attisdropped-aware, since if one tries to
  	 * add a column matching a dropped column name, it's gonna fail anyway.
  	 */
! 	attTuple = SearchSysCache2(ATTNAME,
! 			   ObjectIdGetDatum(myrelid),
! 			   PointerGetDatum(colDef->colname));
! 
! 	if (HeapTupleIsValid(attTuple))
! 	{
! 	int attnum;
! 	attnum = ((Form_pg_attribute) GETSTRUCT(attTuple))->attnum;
! 	if (attnum <= 0)
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg("column \"%s\" conflicts with a system column name",
! 		colDef->colname)));
! 	else
! 		ereport(ERROR,
! (errcode(ERRCODE_DUPLICATE_COLUMN),
!  errmsg("column \"%s\" of relation \"%s\" already exists",
! 		colDef->colname, RelationGetRelationName(rel;
! 
! 		ReleaseSysCache(attTuple);
! }
  
  	/* Determine the new attribute's number */
  	if (isOid)
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
new file mode 100644
index e992549..8385afb
*** a/src/test/regress/expected/alter_table.out
--- b/src/test/regress/expected/alter_table.out
*** COMMENT ON TABLE tmp_wrong IS 'table com
*** 7,12 
--- 7,14 
  ERROR:  relation "tmp_wrong" does not exist
  COMMENT ON TABLE tmp IS 'table comment';
  COMMENT ON TABLE tmp IS NULL;
+ ALTER TABLE tmp ADD COLUMN xmin integer; -- fails
+ ERROR:  column "xmin" conflicts with a system column name
  ALTER TABLE tmp ADD COLUMN a int4 default 3;
  ALTER TABLE tmp ADD COLUMN b name;
  ALTER TABLE tmp ADD COLUMN c text;
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
new file mode 100644
index d9bf08d..d4e4c49
*** a/src/test/regress/sql/alter_table.sql
--- b/src/test/regress/sql/alter_table.sql
*** COMMENT ON TABLE tmp_wrong IS 'table com
*** 9,14 
--- 9,16 
  COMMENT ON TABLE tmp IS 'table comment';
  COMMENT ON TABLE tmp IS NULL;
  
+ ALTER TABLE tmp ADD COLUMN xmin integer; -- fails
+ 
  ALTER TABLE tmp ADD COLUMN a int4 default 3;
  
  ALTER TABLE tmp ADD COLUMN b name;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 8:53 PM, Robert Haas  wrote:
>
> do a single 30-minute pgbench run at scale factor 300 using a variety

Nice

A minor but necessary point: Repeated testing of the Group commit
patch when you have synch commit off is clearly pointless, so
publishing numbers for that without saying very clearly that's what's
happening doesn't really help.

> Graphs are here:
>
> http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

Nicer

> There are two graphs for each branch.  The first is a scatter plot of
> latency vs. transaction time.  I found that graph hard to understand,
> though; I couldn't really tell what I was looking at.  So I made a
> second set of graphs which graph number of completed transactions in a
> given second of the test against time.  The results are also included
> on the previous page, below the latency graphs, and I find them much
> more informative.
>
> A couple of things stand out at me from these graphs.  First, some of
> these transactions had really long latency.  Second, there are a
> remarkable number of seconds all of the test during which no
> transactions at all manage to complete, sometimes several seconds in a
> row.  I'm not sure why.  Third, all of the tests initially start of
> processing transactions very quickly, and get slammed down very hard,
> probably because the very high rate of transaction processing early on
> causes a checkpoint to occur around 200 s.

Check

I'm happy that this exposes characteristics I've seen and have been
discussing for a while now.

It would be useful to calculate the slow-down contribution of the
longer txns. What % of total time is given over to slowest 10% of
transactions. Looking at that, I think you'll see why we should care
about sorting out what happens in the worst cases.

>  I didn't actually log when
> the checkpoints were occuring, but it seems like a good guess.  It's
> also interesting to wonder whether the checkpoint I/O itself causes
> the drop-off, or the ensuing full page writes.  Fourth,
> xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
> that actually changes the whole shape of the tps graph.  I'm
> speculating here, but that may be because it blunts the impact of full
> page writes by allowing backends to copy their full page images into
> the write-ahead log in parallel.

I think we should be working to commit XLogInsert and then Group
Commit, then come back to the discussion.

There's no way we're committing other patches but not those, so
everything needs to be viewed with those two in. i.e. commit and then
re-baseline.

So I'd say no more tests just yet, but then lots of testing next week++.

> But I plan to drop
> buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
> based on Simon's comments elsewhere.  I'm including the results here
> just because these tests were already running when he made those
> comments.

Yep

One you aren't testing is clog_history, which is designed to work in
conjunction with background_clean_slru. That last one clearly needs a
little tuning though...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Merlin Moncure
On Tue, Jan 24, 2012 at 11:55 AM, Robert Haas  wrote:
> I do wonder whether we are making a mountain out of a mole-hill here,
> though.  If I properly understand the proposal on the table, which
> it's possible that I don't, but if I do, the new format is
> self-identifying: when the optimization is in use, it sets a bit that
> previously would always have been clear.  So if we just go ahead and
> change this, clients that have been updated to understand the new
> format will work just fine.  The server uses the proposed optimization
> only for arrays that meet certain criteria, so any properly updated
> client must still be able to handle the case where that bit isn't set.
>  On the flip side, clients that aren't expecting the new optimization
> might break.  But that's, again, no different than what happened when
> we changed the default bytea output format.  If you get bit, you
> either update your client or shut off the optimization and deal with
> the performance consequences of so doing.

Well, the bytea experience was IMNSHO a complete disaster (It was
earlier mentioned that jdbc clients were silently corrupting bytea
datums) and should be held up as an example of how *not* to do things;
it's better to avoid having to depend on the GUC or defensive
programmatic intervention to prevent further occurrences of
application failure since the former doesn't work and the latter won't
be reliably done.  Waiting for applications to break in the field only
to point affected users at the GUC is weak sauce.  It's creating a
user culture that is terrified of database upgrades which hurts
everybody.

Database apps tend to have long lives in computer terms such that they
can greatly outlive the service life of a particular postgres dot
release or even the programmers who originally wrote the application.
I'm not too concerned about the viability of a programming department
with Robert Haas at the helm, but what about when he leaves?  What
about critical 3rd party software that is no longer maintained?

In regards to the array optimization, I think it's great -- but if you
truly want to avoid blowing up user applications, it needs to be
disabled automatically.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] some longer, larger pgbench tests with various performance-related patches

2012-01-24 Thread Robert Haas
Early yesterday morning, I was able to use Nate Boley's test machine
do a single 30-minute pgbench run at scale factor 300 using a variety
of trees built with various patches, and with the -l option added to
track latency on a per-transaction basis.  All tests were done using
32 clients and permanent tables.  The configuration was otherwise
identical to that described here:

http://archives.postgresql.org/message-id/CA+TgmoboYJurJEOB22Wp9RECMSEYGNyHDVFv5yisvERqFw=6...@mail.gmail.com

By doing this, I hoped to get a better understanding of (1) the
effects of a scale factor too large to fit in shared_buffers, (2) what
happens on a longer test run, and (3) how response time varies
throughout the test.  First, here are the raw tps numbers:

background-clean-slru-v2: tps = 2027.282539 (including connections establishing)
buffreelistlock-reduction-v1: tps = 2625.155348 (including connections
establishing)
buffreelistlock-reduction-v1-freelist-ok-v2: tps = 2468.638149
(including connections establishing)
freelist-ok-v2: tps = 2467.065010 (including connections establishing)
group-commit-2012-01-21: tps = 2205.128609 (including connections establishing)
master: tps = 2200.848350 (including connections establishing)
removebufmgrfreelist-v1: tps = 2679.453056 (including connections establishing)
xloginsert-scale-6: tps = 3675.312202 (including connections establishing)

Obviously these numbers are fairly noisy, especially since this is
just one run, so the increases and decreases might not be all that
meaningful.  Time permitting, I'll try to run some more tests to get
my hands around that situation a little better,

Graphs are here:

http://wiki.postgresql.org/wiki/Robert_Haas_9.2CF4_Benchmark_Results

There are two graphs for each branch.  The first is a scatter plot of
latency vs. transaction time.  I found that graph hard to understand,
though; I couldn't really tell what I was looking at.  So I made a
second set of graphs which graph number of completed transactions in a
given second of the test against time.  The results are also included
on the previous page, below the latency graphs, and I find them much
more informative.

A couple of things stand out at me from these graphs.  First, some of
these transactions had really long latency.  Second, there are a
remarkable number of seconds all of the test during which no
transactions at all manage to complete, sometimes several seconds in a
row.  I'm not sure why.  Third, all of the tests initially start of
processing transactions very quickly, and get slammed down very hard,
probably because the very high rate of transaction processing early on
causes a checkpoint to occur around 200 s.  I didn't actually log when
the checkpoints were occuring, but it seems like a good guess.  It's
also interesting to wonder whether the checkpoint I/O itself causes
the drop-off, or the ensuing full page writes.  Fourth,
xloginsert-scale-6 helps quite a bit; in fact, it's the only patch
that actually changes the whole shape of the tps graph.  I'm
speculating here, but that may be because it blunts the impact of full
page writes by allowing backends to copy their full page images into
the write-ahead log in parallel.

One thing I also noticed while running the tests is that the system
was really not using much CPU time.  It was mostly idle.  That could
be because waiting for I/O leads to waiting for locks, or it could be
fundamental lock contention.  I don't know which.

A couple of obvious further tests suggest themselves: (1) rerun some
of the tests with full_page_writes=off, and (2) repeat this test with
the remaining performance-related patches.  It would be especially
interesting, I think, to see what effect the checkpoint-related
patches have on these graphs.  But I plan to drop
buffreelistlock-reduction-v1 and freelist-ok-v2 from future test runs
based on Simon's comments elsewhere.  I'm including the results here
just because these tests were already running when he made those
comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs  wrote:

> Yes, it might be too hard, but lets look.

Your committer has timed out ;-)

committed write mode only

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Stefan Kaltenbrunner
On 01/24/2012 07:36 PM, Dave Page wrote:
> What date & venue?
> 
> On Tuesday, January 24, 2012, Joshua D. Drake  wrote:
>>
>> Hello,
>>
>> The call for papers for PgNext (the old PgWest/PgEast) is now open:
>>
>> January 19th: Talk submission opens
>> April 15th: Talk submission closes
>> April 30th: Speaker notification
>>
>> Submit: https://www.postgresqlconference.org/talk_types


the url has in big letters has "Denver Convention Center" on June
26th-29th, 2012...


Stefan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] controlling the location of server-side SSL files

2012-01-24 Thread Peter Eisentraut
On tor, 2012-01-19 at 15:44 -0500, Robert Haas wrote:
> On Sat, Jan 14, 2012 at 8:40 AM, Peter Eisentraut  wrote:
> > On mån, 2012-01-02 at 06:32 +0200, Peter Eisentraut wrote:
> >> I think I would like to have a set of GUC parameters to control the
> >> location of the server-side SSL files.
> >
> > Here is the patch for this.
> >
> > One thing that is perhaps worth thinking about:  Currently, we just
> > ignore missing root.crt and root.crl files.  With this patch, we still
> > do this, even if the user has given a specific nondefault location.
> > That seems a bit odd, but I can't think of a simple way to do it better.
> 
> There's a review in the CF app for this finding only minor issues, so
> I'm marking this patch therein as "Ready for Committer".

OK, no one had any concerns about the missing file behavior I described
above?  If not, then I'll commit it soon.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgNext: CFP

2012-01-24 Thread Dave Page
What date & venue?

On Tuesday, January 24, 2012, Joshua D. Drake  wrote:
>
> Hello,
>
> The call for papers for PgNext (the old PgWest/PgEast) is now open:
>
> January 19th: Talk submission opens
> April 15th: Talk submission closes
> April 30th: Speaker notification
>
> Submit: https://www.postgresqlconference.org/talk_types
>
> Sincerely,
>
> JD
>
> --
> Command Prompt, Inc. - http://www.commandprompt.com/
> PostgreSQL Support, Training, Professional Services and Development
> The PostgreSQL Conference - http://www.postgresqlconference.org/
> @cmdpromptinc - @postgresconf - 509-416-6579
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas  writes:
> Yes, that's what I meant when I suggested it originally.  I'm just not
> sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING.

I'm inclined to think that it probably is nicer, just because of less
vertical space used.  But again, this opinion is contingent on what it
will look like after pgindent gets done with it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-24 Thread Robert Haas
On Sun, Jan 22, 2012 at 11:04 AM, Julien Tachoires  wrote:
> 2011/12/15 Alvaro Herrera :
>>
>> Uhm, surely you could compare the original toast tablespace to the heap
>> tablespace, and if they differ, handle appropriately when creating the
>> new toast table?  Just pass down the toast tablespace into
>> AlterTableCreateToastTable, instead of having it assume that
>> rel->rd_rel->relnamespace is sufficient.  This should be done in all
>> cases where a toast tablespace is created, which shouldn't be more than
>> a handful of them.
>
> Thank you, that way seems right.
> Now, I distinguish before each creation of a TOAST table with
> AlterTableCreateToastTable() : if it will create a new one or recreate
> an existing one.
> Thus, in create_toast_table() when toastTableSpace is equal to
> InvalidOid, we are able :
> - to fallback to the main table tablespace in case of new TOAST table creation
> - to keep it previous tablespace in case of recreation.
>
> Here's a new version rebased against HEAD.

To ask more directly the question that's come up a few times upthread,
why do *you* think this is useful?  What motivated you to want this
behavior, and/or how do you think it could benefit other PostgreSQL
users?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 1:03 PM, Tom Lane  wrote:
> I wrote:
>> Also, it occurs to me that an intermediate macro
>> "PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it
>> documents *why* you want to mark the variable as possibly unused,
>> and second because changing the macro definition would provide an easy way
>> to check for totally-unused variables, in case we wanted to periodically
>> make such checks.
>
> Uh, wait a second.  Why not
>
> #ifdef USE_ASSERT_CHECKING
> #define PG_USED_FOR_ASSERTS_ONLY
> #else
> #define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused))
> #endif
>
> Then, when you build with asserts on, you *automatically* get told
> if the variable is entirely unused.

Yes, that's what I meant when I suggested it originally.  I'm just not
sure it's any nicer than adding ifdefs for USE_ASSERT_CHECKING.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
I wrote:
> Also, it occurs to me that an intermediate macro
> "PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it
> documents *why* you want to mark the variable as possibly unused,
> and second because changing the macro definition would provide an easy way
> to check for totally-unused variables, in case we wanted to periodically
> make such checks.

Uh, wait a second.  Why not

#ifdef USE_ASSERT_CHECKING
#define PG_USED_FOR_ASSERTS_ONLY
#else
#define PG_USED_FOR_ASSERTS_ONLY __attribute__((unused))
#endif

Then, when you build with asserts on, you *automatically* get told
if the variable is entirely unused.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane  wrote:
>> Ouch. Is it really true that __attribute__((unused)) disables detection
>> of use of uninitialized variables?

> Oh, I think maybe I am confused.  The downsides of disabling *unused*
> variable detection are obviously much less than the downsides of
> disabling *uninitialized* variable declaration... although neither is
> ideal.

OK.  MHO is that __attribute__((unused)) is probably less annoying than
#ifdef overall.  Also, it occurs to me that an intermediate macro
"PG_USED_FOR_ASSERTS_ONLY" would be a good idea, first because it
documents *why* you want to mark the variable as possibly unused,
and second because changing the macro definition would provide an easy way
to check for totally-unused variables, in case we wanted to periodically
make such checks.

This is all modulo the question of what pgindent will do with it,
which I would still like to see tested before we commit to a method.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 11:16 AM, Merlin Moncure  wrote:
>> Our current protocol allocates a 2-byte integer for the purposes of
>> specifying the type of each parameter, and another 2-byte integer for
>> the purpose of specifying the result type... but only one bit is
>> really needed at present: text or binary.  If we revise the protocol
>> version at some point, we might want to use some of that bit space to
>> allow some more fine-grained negotiation of the protocol version.  So,
>> for example, we might define the top 5 bits as reserved (always pass
>> zero), the next bit as a text/binary flag, and the remaining 10 bits
>> as a 10-bit "format version number".  When a change like this comes
>> along, we can bump the highest binary format version recognized by the
>> server, and clients who request the new version can get it.
>>
>> Alternatively, we might conclude that a 2-byte integer for each
>> parameter is overkill and try to cut back... but the point is there's
>> a bunch of unused bitspace there now.  In theory we could even do
>> something this without bumping the protocol version since the
>> documentation seems clear that any value other than 0 and 1 yields
>> undefined behavior, but in practice that seems like it might be a bit
>> too edgy.
>
> Yeah.  But again, this isn't a contract between libpq and the server,
> but between the application and the server...

I don't see how this is relevant.  The text/binary format flag is
there in both libpq and the underlying protocol.

>  So I'd vote against any format code
> beyond the text/binary switch that currently exists (which, by the
> way, while useful, is one of the great sins of libpq that we have to
> deal with basically forever).  While wire formatting is granular down
> to the type level, applications should not have to deal with that.
> They should Just Work.  So who decides what format code to stuff into
> the protocol?  Where are the codes defined?
>
> I'm very much in the camp that sometime, presumably during connection
> startup, the protocol accepts a non-#defined-in-libpq token (database
> version?) from the application that describes to the server what wire
> formats can be used and the server sends one back.  There probably has
> to be some additional facilities for non-core types but let's put that
> aside for the moment.  Those two tokens allow the server to pick the
> highest supported wire format (text and binary!) that everybody
> understands.  The server's token is useful if we're being fancy and we
> want libpq to translate an older server's wire format to a newer one
> for the application.  This of course means moving some of the type
> system into the client, which is something we might not want to do
> since among other things it puts a heavy burden on non-libpq driver
> authors (but then again, they can always stay on the v3 protocol,
> which can benefit from being frozen in terms of wire formats).

I think it's sensible for the server to advertise a version to the
client, but I don't see how you can dismiss add-on types so blithely.
The format used to represent any given type is logically a property of
that type, and only for built-in types is that associated with the
server version.

I do wonder whether we are making a mountain out of a mole-hill here,
though.  If I properly understand the proposal on the table, which
it's possible that I don't, but if I do, the new format is
self-identifying: when the optimization is in use, it sets a bit that
previously would always have been clear.  So if we just go ahead and
change this, clients that have been updated to understand the new
format will work just fine.  The server uses the proposed optimization
only for arrays that meet certain criteria, so any properly updated
client must still be able to handle the case where that bit isn't set.
 On the flip side, clients that aren't expecting the new optimization
might break.  But that's, again, no different than what happened when
we changed the default bytea output format.  If you get bit, you
either update your client or shut off the optimization and deal with
the performance consequences of so doing.  In fact, the cases are
almost perfectly analogous, because in each case the proposal was
based on the size of the output format being larger than necessary,
and wanting to squeeze it down to a smaller size for compactness.

And more generally, does anyone really expect that we're never going
to change the output format of any type we support ever again, without
retaining infinite backward compatibility?  I didn't hear any screams
of outrage when we updated the hyphenation rules for contrib/isbn -
well, ok, there were some howls, but that was because the rules were
still incomplete and US-centric, not so much because people thought it
was unacceptable for the hyphenation rules to be different in major
release N+1 than they were in major release N.  If the IETF goes and
defines a new standard for formatting IPv6 addresses, we're lik

[HACKERS] PgNext: CFP

2012-01-24 Thread Joshua D. Drake


Hello,

The call for papers for PgNext (the old PgWest/PgEast) is now open:

January 19th: Talk submission opens
April 15th: Talk submission closes
April 30th: Speaker notification

Submit: https://www.postgresqlconference.org/talk_types

Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Jim Nasby
On Jan 24, 2012, at 9:15 AM, Simon Riggs wrote:
> On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat  wrote:
>>> And yes, I would for sure turn such functionality on if it were present.
>>> 
>> 
>> That's nice to say, but most people aren't willing to take a 50%
>> performance hit. Not saying what we end up with will be that bad, but
>> I've seen people get upset about performance hits much lower than
>> that.
> When we talk about a 50% hit, are we discussing (1) checksums that are
> checked on each I/O, or (2) checksums that are checked each time we
> re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
> not yet been measured, so shouldn't be used unqualified when
> discussing checksums. Same thing is also true "I would use it"
> comments, since we're not sure whether you're voting for (1) or (2).
> 
> As to whether people will actually use (1), I have no clue. But I do
> know is that many people request that feature, including people that
> run heavy duty Postgres production systems and who also know about
> filesystems. Do people need (2)? It's easy enough to add as an option,
> once we have (1) and there is real interest.

Some people will be able to take a 50% hit and will happily turn on 
checksumming every time a page is pinned. But I suspect a lot of folks can't 
afford that kind of hit, but would really like to have their filesystem cache 
protected (we're certainly in the later camp).

As for checksumming filesystems, I didn't see any answers about whether the 
filesystem *cache* was also protected by the filesystem checksum. Even if it 
is, the choice of checksumming filesystems is certainly limited... ZFS is the 
only one that seems to have real traction, but that forces you off of Linux, 
which is a problem  for many shops.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 12:12 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Spraying the code with __attribute__((unused)) is somewhat undesirable
>> because it could mask a failure to properly initialize the variable in
>> an assert-enabled build.
>
> Ouch.  Is it really true that __attribute__((unused)) disables detection
> of use of uninitialized variables?  That would be nasty, and it's not
> obvious to me why it should need to work like that.  But if it is true,
> then I agree that that makes this approach not too tenable.

Oh, I think maybe I am confused.  The downsides of disabling *unused*
variable detection are obviously much less than the downsides of
disabling *uninitialized* variable declaration... although neither is
ideal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Tom Lane
Robert Haas  writes:
> Spraying the code with __attribute__((unused)) is somewhat undesirable
> because it could mask a failure to properly initialize the variable in
> an assert-enabled build.

Ouch.  Is it really true that __attribute__((unused)) disables detection
of use of uninitialized variables?  That would be nasty, and it's not
obvious to me why it should need to work like that.  But if it is true,
then I agree that that makes this approach not too tenable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-24 Thread Robert Haas
On Sat, Jan 21, 2012 at 1:06 PM, Peter Eisentraut  wrote:
> So, here are three patches that could solve this issue.
>
> pg-cassert-unused-attribute.patch, the one I already showed, using
> __attribute__((unused).
>
> pg-cassert-unused-ifdef.patch, using only additional #ifdef
> USE_ASSERT_CHECKING.  This does have additional documentation value, but
> you can see that it gets bulky and complicated.  (I introduced several
> bugs merely while creating this patch.)
>
> pg-cassert-unused-void.patch is an alternative approach that avoids the
> warnings by casting the arguments of Assert() to void if assertions are
> disabled.  This has the least code impact, but it changes the
> traditional semantics of asserts, which is that they disappear
> completely when not enabled.  You can see how this creates a problem in
> src/backend/replication/syncrep.c, where the nontrivial call to
> SyncRepQueueIsOrderedByLSN() now becomes visible even in non-assert
> builds.  I played around with some other options like
> __attribute__((pure)) to make the compiler optimize the function away in
> that case, but that didn't appear to work.  So this might not be a
> workable solution (and it would be GCC-specific anyway).

I think the third approach is unacceptable from a performance point of view.

Some of these problems can be fixed without resorting to as much
hackery as you have here.  For example, in nodeWorkTableScan.c, you
could easily fix the problem by getting rid of the estate variable and
replacing its single use within the assertion with the expression from
used to initialize it on the previous line.  I think this might the
cleanest solution in general.

I'm not sure what to do about the cases where that isn't practical.
Spraying the code with __attribute__((unused)) is somewhat undesirable
because it could mask a failure to properly initialize the variable in
an assert-enabled build.  We could have a macro
PG_UNUSED_IF_NO_ASSERTS or something, but that doesn't have any
obvious advantage over just getting rid of the variable altogether in
such cases.  I lean toward the view that variables not needed in
assertion-free builds should be #ifdef'd out altogether, as in your
second patch, but we should try to minimize the number of places where
we need to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 11:25 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> I doubt it.  Almost nothing in the backend is thread-safe.  You can't
>> acquire a heavyweight lock, a lightweight lock, or a spinlock. You
>> can't do anything that might elog() or ereport().  None of those
>> things are reentrant.
>
> Not to mention palloc, another extremely fundamental and non-reentrant
> subsystem.
>
> Possibly we could work on making all that stuff re-entrant, but it would
> be a huge amount of work for a distant and uncertain payoff.

Right.  I think it makes more sense to try to get parallelism working
first with the infrastructure we have.  Converting to use threading,
if we ever do it at all, should be something we view as a later
performance optimization.  But I suspect we won't want to do it
anyway; I think there will be easier ways to get where we want to be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Tom Lane
Robert Haas  writes:
> I doubt it.  Almost nothing in the backend is thread-safe.  You can't
> acquire a heavyweight lock, a lightweight lock, or a spinlock. You
> can't do anything that might elog() or ereport().  None of those
> things are reentrant.

Not to mention palloc, another extremely fundamental and non-reentrant
subsystem.

Possibly we could work on making all that stuff re-entrant, but it would
be a huge amount of work for a distant and uncertain payoff.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Measuring relation free space

2012-01-24 Thread Jaime Casanova
On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch  wrote:
> On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote:
>>
>> Hm.  Leaf pages hold as much tuples as non-leaf pages, no?  I mean
>> for each page element there's a value and a CTID.  In non-leaf those
>> CTIDs point to other index pages, one level down the tree; in leaf pages
>> they point to the heap.
>
> That distinction seemed important when I sent my last message, but now I agree
> that it's largely irrelevant for free space purposes.  If someone feels like
> doing it, +1 for making pgstattuple() count non-leaf free space.
>

actually i agreed that non-leaf pages are irrelevant... i just
confirmed that in a production system with 300GB none of the indexes
in an 84M rows table nor in a heavily updated one has more than 1 root
page, all the rest are deleted, half_dead or leaf. so the posibility
of bloat coming from non-leaf pages seems very odd

but the possibility of bloat coming from the meta page doesn't exist,
AFAIUI at least

we need the most accurate value about usable free space, because the
idea is to add a sampler mode to the function so we don't scan the
whole relation. that's why we still need the function.

btw... pgstattuple also has the problem that it's not using a ring buffer


attached are two patches:
- v5: is the same original patch but only track space in leaf, deleted
and half_dead pages
- v5.1: adds the same for all kind of indexes (problem is that this is
inconsistent with the fact that pageinspect only manages btree indexes
for everything else)

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
new file mode 100644
index 13ba6d3..63fab95
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
*** MODULE_big	= pageinspect
*** 4,10 
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
--- 4,12 
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
!pageinspect--1.0--1.1.sql \
!pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index dbb2158..9c0b0fb
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***
*** 34,39 
--- 34,40 
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
  
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
*** GetBTPageStatistics(BlockNumber blkno, B
*** 155,160 
--- 156,216 
  		stat->avg_item_size = 0;
  }
  
+ /*
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BTPageStat stat;
+ 
+ 	Buffer		buffer;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+ BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Skip page 0 because it is a metapage */
+ 	for (blkno = 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		/* 
+ 		 * get the statistics of the indexes and use that info
+ 		 * to determine free space on the page
+ 		 */
+ 		GetBTPageStatistics(blkno, buffer, &stat);
+ 		/* 
+ 		 * Consider pages DELETED and HALF_DEAD as empty,
+ 		 * besides those only consider LEAF pages
+ 		 */
+ 		if (stat.type == 'd' || stat.type == 'e')
+ 		{
+ 			free_space += stat.page_size;
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (stat.type == 'l')
+ 		{
+ 			free_space += stat.free_size;		
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ 
  /* ---
   * bt_page()
   *
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode 100644
index ...549f878
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***
*** 0 
--- 1,5 
+ /*
+  * contrib/pageinspect/btreefuncs.h
+  */
+ 
+ float4 GetBTRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
new file mode 100644
index 260ccff..8e3961d
*** a/contrib/pageinspect/heapfuncs.c
--- b/contrib/pageinspect/heapfunc

Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Merlin Moncure
On Tue, Jan 24, 2012 at 8:26 AM, Robert Haas  wrote:
> On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure  wrote:
>> I'm not sure that you're getting anything with that user facing
>> complexity.  The only realistic case I can see for explicit control of
>> wire formats chosen is to defend your application from format changes
>> in the server when upgrading the server and/or libpq.   This isn't a
>> "let's get better compression problem", this is "I upgraded my
>> database and my application broke" problem.
>>
>> Fixing this problem in non documentation fashion is going to require a
>> full protocol change, period.
>
> Our current protocol allocates a 2-byte integer for the purposes of
> specifying the type of each parameter, and another 2-byte integer for
> the purpose of specifying the result type... but only one bit is
> really needed at present: text or binary.  If we revise the protocol
> version at some point, we might want to use some of that bit space to
> allow some more fine-grained negotiation of the protocol version.  So,
> for example, we might define the top 5 bits as reserved (always pass
> zero), the next bit as a text/binary flag, and the remaining 10 bits
> as a 10-bit "format version number".  When a change like this comes
> along, we can bump the highest binary format version recognized by the
> server, and clients who request the new version can get it.
>
> Alternatively, we might conclude that a 2-byte integer for each
> parameter is overkill and try to cut back... but the point is there's
> a bunch of unused bitspace there now.  In theory we could even do
> something this without bumping the protocol version since the
> documentation seems clear that any value other than 0 and 1 yields
> undefined behavior, but in practice that seems like it might be a bit
> too edgy.

Yeah.  But again, this isn't a contract between libpq and the server,
but between the application and the server...unless you want libpq to
do format translation to something the application can understand (but
even then the application is still involved).  I'm not very
enthusiastic about encouraging libpq application authors to pass
format #defines for every single parameter and consumed datum to get
future proofing on wire formats.  So I'd vote against any format code
beyond the text/binary switch that currently exists (which, by the
way, while useful, is one of the great sins of libpq that we have to
deal with basically forever).  While wire formatting is granular down
to the type level, applications should not have to deal with that.
They should Just Work.  So who decides what format code to stuff into
the protocol?  Where are the codes defined?

I'm very much in the camp that sometime, presumably during connection
startup, the protocol accepts a non-#defined-in-libpq token (database
version?) from the application that describes to the server what wire
formats can be used and the server sends one back.  There probably has
to be some additional facilities for non-core types but let's put that
aside for the moment.  Those two tokens allow the server to pick the
highest supported wire format (text and binary!) that everybody
understands.  The server's token is useful if we're being fancy and we
want libpq to translate an older server's wire format to a newer one
for the application.  This of course means moving some of the type
system into the client, which is something we might not want to do
since among other things it puts a heavy burden on non-libpq driver
authors (but then again, they can always stay on the v3 protocol,
which can benefit from being frozen in terms of wire formats).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 2:45 PM, Merlin Moncure  wrote:
> Yes, but OP is proposing to use multiple threads inside the forked
> execution process.  That's a completely different beast.  Many other
> databases support parallel execution of a single query and it might
> very well be better/easier to do that with threads.

I doubt it.  Almost nothing in the backend is thread-safe.  You can't
acquire a heavyweight lock, a lightweight lock, or a spinlock. You
can't do anything that might elog() or ereport().  None of those
things are reentrant.  Consequently, you can't do anything that
involves reading or pinning a buffer, making a syscache lookup, or
writing WAL.  You can't even  do something like parallelize the
qsort() of a chunk of data that's already been read into a private
buffer... because you'd have to call the comparison functions for the
data type, and they might elog() or ereport().  Of course, in certain
special cases (like int4) you could make it safe, but it's hard for to
imagine anyone wanting to go to that amount of effort for such a small
payoff.

If we're going to do parallel query in PG, and I think we are going to
need to do that eventually, we're going to need a system where large
chunks of work can be handed off, as in the oft-repeated example of
parallelizing an append node by executing multiple branches
concurrently.  That's where the big wins are.  And that means either
overhauling the entire backend to make it thread-safe, or using
multiple backends.  The latter will be hard, but it'll still be a lot
easier than the former.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat  wrote:
>> And yes, I would for sure turn such functionality on if it were present.
>>
>
> That's nice to say, but most people aren't willing to take a 50%
> performance hit. Not saying what we end up with will be that bad, but
> I've seen people get upset about performance hits much lower than
> that.

When we talk about a 50% hit, are we discussing (1) checksums that are
checked on each I/O, or (2) checksums that are checked each time we
re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
not yet been measured, so shouldn't be used unqualified when
discussing checksums. Same thing is also true "I would use it"
comments, since we're not sure whether you're voting for (1) or (2).

As to whether people will actually use (1), I have no clue. But I do
know is that many people request that feature, including people that
run heavy duty Postgres production systems and who also know about
filesystems. Do people need (2)? It's easy enough to add as an option,
once we have (1) and there is real interest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Simon Riggs
On Mon, Jan 23, 2012 at 4:01 PM, Tom Lane  wrote:

> The real
> problem there is that BufFreelistLock is also used to protect the
> clock sweep pointer.

Agreed

> I think basically we gotta find a way to allow
> multiple backends to run clock sweeps concurrently.  Or else fix
> things so that the freelist never (well, hardly ever) runs dry.

Agreed.

The only question is what do we do now. I'm happy with the thought of
jam tomorrow, I'd like to see some minor improvements now, given that
when we say "we'll do that even better in the next release" it often
doesn't happen. Which is where those patches come in.

I've posted an improved lock wait analysis patch and have scheduled
some runs on heavily used systems to see what this tells us. Results
from live machines also greatly appreciated, since test systems seem
likely to be inappropriate tests. Patch copied here again.

This is a key issue since RAM is cheap enough now that people are
swamped by it, so large shared_buffers are very common.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Jeff Janes
On Mon, Jan 23, 2012 at 5:06 AM, Robert Haas  wrote:
> On Mon, Jan 23, 2012 at 12:12 AM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Sat, Jan 21, 2012 at 5:29 PM, Jim Nasby  wrote:
 We should also look at having the freelist do something useful, instead of 
 just dropping it completely. Unfortunately that's probably more work...
>>
>>> That's kinda my feeling as well.  The free list in its current form is
>>> pretty much useless, but I don't think we'll save much by getting rid
>>> of it, because that's just a single test.  The expensive part of what
>>> we do while holding BufFreelistLock is, I think, iterating through
>>> buffers taking and releasing a spinlock on each one (!).
>>
>> Yeah ... spinlocks that, by definition, will be uncontested.
>
> What makes you think that they are uncontested?

It is automatically partitioned over 131,072 spinlocks if
shared_buffers=1GB, so the chances of contention seem pretty low.
If a few very hot index root blocks are heavily contested, still that
would only be
encountered a few times out of the 131,072.  I guess we could count
them, similar
to your LWLOCK_STATS changes.

> Or for that matter,
> that even an uncontested spinlock operation is cheap enough to do
> while holding a badly contended LWLock?

Is the concern that getting a uncontested spinlock has lots of
instructions, or that the associated fences are expensive?

>
>> So I think
>> it would be advisable to prove rather than just assume that that's a
>> problem.
>
> It's pretty trivial to prove that there is a very serious problem with
> BufFreelistLock.  I'll admit I can't prove what the right fix is just
> yet, and certainly measurement is warranted.


From my own analysis and experiments, the mere act of getting the
BufFreelistLock when it is contended is far more important than
anything actually done while holding that lock.  When the clock-sweep
is done often enough that the lock is contended, then it is done often
enough to keep the average usagecount low and so the average number of
buffers visited during a clock sweep before finding a usable one was
around 2.0.  YMMV.

Can we get some people who run busy high-CPU servers that churn the
cache and think they may be getting hit with this problem post the
results of this query:

select usagecount, count(*) from pg_buffercache group by usagecount;


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Robert Treat
On Tue, Jan 24, 2012 at 3:02 AM,   wrote:
>> * Robert Treat:
>>
>>> Would it be unfair to assert that people who want checksums but aren't
>>> willing to pay the cost of running a filesystem that provides
>>> checksums aren't going to be willing to make the cost/benefit trade
>>> off that will be asked for? Yes, it is unfair of course, but it's
>>> interesting how small the camp of those using checksummed filesystems
>>> is.
>>
>> Don't checksumming file systems currently come bundled with other
>> features you might not want (such as certain vendors)?
>
> I would chip in and say that I would prefer sticking to well-known proved
> filesystems like xfs/ext4 and let the application do the checksumming.
>

*shrug* You could use Illumos or BSD and you'd get generally vendor
free systems using ZFS, which I'd say offers more well-known and
proved checksumming than anything cooking in linux land, or than the
as-to-be-written yet checksumming in postgres.

> I dont forsee fully production-ready checksumming filesystems readily
> available in the standard Linux distributions within a near future.
>
> And yes, I would for sure turn such functionality on if it were present.
>

That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.


Robert Treat
conjecture: xzilla.net
consulting: omniti.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Removing freelist (was Re: [HACKERS] Should I implement DROP INDEX CONCURRENTLY?)

2012-01-24 Thread Jeff Janes
On Sat, Jan 21, 2012 at 2:29 PM, Jim Nasby  wrote:
> On Jan 20, 2012, at 11:54 AM, Heikki Linnakangas wrote:

>> So, you're proposing that we remove freelist altogether? Sounds reasonable, 
>> but that needs to be performance tested somehow. I'm not sure what exactly 
>> the test should look like, but it probably should involve an OLTP workload, 
>> and large tables being created, populated to fill the cache with pages from 
>> the table, and dropped, while the OLTP workload is running. I'd imagine that 
>> the worst case scenario looks something like that.
>
> We should also look at having the freelist do something useful, instead of 
> just dropping it completely. Unfortunately that's probably more work...

If the head and tail are both protected by BufFreelistLock, I'm pretty
sure this will make things worse, not better.

If we change to having head and tail protected by separate spinlocks,
then I don't see how you can remove the last buffer from the list, or
add a buffer to an empty list, without causing havoc.

Does anyone have ideas for implementing a cross-platform, lock-free,
FIFO linked list?  Without that, I don't see how we are going to get
anywhere on this approach.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inline Extension

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 7:59 PM, Daniel Farina  wrote:
> Things are still a bit ugly in the more complex cases: consider
> PostGIS's linkage against libproj and other libraries.  In order to
> cover all cases, I feel that what I need is an optional hook (for the
> same of argument, let's say it's another "command" type hook, e.g.
> "archive_command") to be executed when extension (un)installation is
> occurs on a primary or is replayed on a standby whereby I can acquire
> the necessary dependencies for an extension or signal some kind of
> error (as to exactly how that interfaces with the server is delicate,
> should one want to supply good error messages to the user).

There aren't a whole lot of good options for handling errors on the
standby, in general.  If the operation has already occurred on the
master, it's too late to decide that we're not going to have it happen
on the standby as well.

Of course, if we confine ourselves to control and SQL files, then it
doesn't really matter: the catalog entries from the master will make
it to the standby regardless of the absence of the SQL and control
files, and maybe we could simply decide that the operations which
write in pg_extension aren't WAL-logged and can be separately
performed on the standby if you want them there as well.

But, that's a bit unlike how we normally do things.  And if we're
going to WAL-log the writing of the extension files, then I start to
feel like we should go back to putting the data in a system catalog
rather than the filesystem, because inventing a new WAL record type
for this seems like it will make things quite a bit more complicated
for no particular benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GUC_REPORT for protocol tunables was: Re: [HACKERS] Optimize binary serialization format of arrays with fixed size elements

2012-01-24 Thread Robert Haas
On Mon, Jan 23, 2012 at 5:49 PM, Merlin Moncure  wrote:
> I'm not sure that you're getting anything with that user facing
> complexity.  The only realistic case I can see for explicit control of
> wire formats chosen is to defend your application from format changes
> in the server when upgrading the server and/or libpq.   This isn't a
> "let's get better compression problem", this is "I upgraded my
> database and my application broke" problem.
>
> Fixing this problem in non documentation fashion is going to require a
> full protocol change, period.

Our current protocol allocates a 2-byte integer for the purposes of
specifying the type of each parameter, and another 2-byte integer for
the purpose of specifying the result type... but only one bit is
really needed at present: text or binary.  If we revise the protocol
version at some point, we might want to use some of that bit space to
allow some more fine-grained negotiation of the protocol version.  So,
for example, we might define the top 5 bits as reserved (always pass
zero), the next bit as a text/binary flag, and the remaining 10 bits
as a 10-bit "format version number".  When a change like this comes
along, we can bump the highest binary format version recognized by the
server, and clients who request the new version can get it.

Alternatively, we might conclude that a 2-byte integer for each
parameter is overkill and try to cut back... but the point is there's
a bunch of unused bitspace there now.  In theory we could even do
something this without bumping the protocol version since the
documentation seems clear that any value other than 0 and 1 yields
undefined behavior, but in practice that seems like it might be a bit
too edgy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basic pgbench runs with various performance-related patches

2012-01-24 Thread Robert Haas
On Tue, Jan 24, 2012 at 1:26 AM, Tatsuo Ishii  wrote:
>> ** pgbench, permanent tables, scale factor 100, 300 s **
>> 1 group-commit-2012-01-21 614.425851 -10.4%
>> 8 group-commit-2012-01-21 4705.129896 +6.3%
>> 16 group-commit-2012-01-21 7962.131701 +2.0%
>> 24 group-commit-2012-01-21 13074.939290 -1.5%
>> 32 group-commit-2012-01-21 12458.962510 +4.5%
>> 80 group-commit-2012-01-21 12907.062908 +2.8%
>
> Interesting. Comparing with this:
> http://archives.postgresql.org/pgsql-hackers/2012-01/msg00804.php
> you achieved very small enhancement. Do you think of any reason which
> makes the difference?

My test was run with synchronous_commit=off, so I didn't expect the
group commit patch to have much of an impact.  I included it mostly to
see whether by chance it helped anyway (since it also helps other WAL
flushes, not just commits) or whether it caused any regression.

One somewhat odd thing about these numbers is that, on permanent
tables, all of the patches seemed to show regressions vs. master in
single-client throughput.  That's a slightly difficult result to
believe, though, so it's probably a testing artifact of some kind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2012-01-24 Thread Alexander Korotkov
Hi!

New version of patch is attached.
On Thu, Dec 22, 2011 at 11:46 AM, Jeff Davis  wrote:

> A few comments:
>
> * In range_gist_picksplit, it would be nice to have a little bit more
> intuitive description of what's going on with the nonEmptyCount and
> nonInfCount numbers. For instance, it appears to depend on the fact that
> a range must either be in nonEmptyCount or in nonInfCount. Also, can you
> describe the reason you're multiplying by two and taking the absolute
> value? It seems to work, but I am missing the intuition behind those
> operations.
>
total_count - 2*nonEmptyCount = (total_count - nonEmptyCount) -
nonEmptyCount = emptyCount - nonEmptyCount
So, it's really not evident. I've simplified it.



> * The penalty function is fairly hard to read still. At a high level, I
> think we're trying to accomplish a few things (in order from most to
> least important):
>  - Keep normal ranges separate.
>  - Avoid broadening the class of the original predicate (e.g. turning
> single-infinite into double-infinite).
>  - Avoid broadening (as determined by subtype_diff) the original
> predicate.
>  - Favor adding ranges to narrower original predicates.
>
> Do you agree? If so, perhaps we can attack those more directly and it
> might be a little more readable.
>
> Additionally, the arbitrary numbers might become a problem. Can we
> choose better constants there? They would still be arbitrary when
> compared with real numbers derived from subtype_diff, but maybe we can
> still do better than what's there.
>
I've changes some comments and add constants for penalty values.


> * Regarding the leftover "common" entries that can go to either side:
> what is the "delta" measure trying to accomplish? When I work through
> some examples, it seems to favor putting larger common ranges on the
> left (small delta) and smaller common ranges on the right (smaller
> delta). Why is that good? Or did I misread the code?
>
> Intuitively, I would think that we'd want to push the ranges with lower
> upper bounds to the left and higher lower bounds to the right -- in
> other words, recurse. Obviously, we'd need to make sure it terminated at
> some point, but splitting the common entries does seem like a smaller
> version of the original problem. Thoughts?
>
That was a bug. Actually, no "abs" is needed. Indeed it doesn't affect
result significantly.

-
With best regards,
Alexander Korotkov.


rangetypegist-0.6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 10:47 AM, Fujii Masao  wrote:

>>> I'm afraid I could not understand your idea. Could you explain it in
>>> more detail?
>>
>> We either tell libpqwalreceiver about the latch, or we tell
>> walreceiver about the socket used by libpqwalreceiver.
>>
>> In either case we share a pointer from one module to another.
>
> The former seems difficult because it's not easy to link libpqwalreceiver.so
> to the latch. I will consider about the latter.

Yes, it might be too hard, but lets look.

>>> You mean to change the meaning of apply_location? Currently it indicates
>>> the end + 1 of the last replayed WAL record, regardless of whether it's
>>> a commit record or not. So too many replies can be sent per incoming
>>> message because it might contain many WAL records. But you mean to
>>> change apply_location only when a commit record is replayed?
>>
>> There is no change to the meaning of apply_location. The only change
>> is that we send that message only when it has an updated value of
>> committed lsn.
>
> This means that apply_location might return the different location from
> pg_last_xlog_replay_location() on the standby, though in 9.1 they return
> the same. Which might confuse a user. No?

The two values only match on a quiet system anyway, since both are
moving forwards.

They will still match on a quiet system.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 10:54 AM, Simon Riggs  wrote:
> On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao  wrote:
>
>>> I'll proceed to commit for this now.
>>
>> Thanks a lot!
>
> Can I just check a few things?

Just to clarify, not expecting another patch version, just reply here
and I can edit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao  wrote:

>> I'll proceed to commit for this now.
>
> Thanks a lot!

Can I just check a few things?

You say
/*
+* Update full_page_writes in shared memory and write an
+* XLOG_FPW_CHANGE record before resource manager writes cleanup
+* WAL records or checkpoint record is written.
+*/

why does it need to be before the cleanup and checkpoint?

You say
/*
+* Currently only non-exclusive backup can be taken during recovery.
+*/

why?

You mention in the docs
"The backup history file is not created in the database cluster backed up."
but we need to explain the bad effect, if any.

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

There are some good explanations in comments of other things, just not
everywhere needed.

What happens if we shutdown the WALwriter and then issue SIGHUP?

Are we sure we want to make the change of file format mandatory? That
means earlier versions of clients such as pg_basebackup will fail
against this version. Should we allow that if BACKUP FROM is missing
we assume it was master?

There are no docs to explain the new feature is available in the main
docs, or to explain the restrictions.
I expect you will add that later after commit.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] New replication mode: write

2012-01-24 Thread Fujii Masao
On Mon, Jan 23, 2012 at 9:53 PM, Simon Riggs  wrote:
> On Mon, Jan 23, 2012 at 10:03 AM, Fujii Masao  wrote:
>
 To make the walreceiver call WaitLatchOrSocket(), we would need to
 merge it and libpq_select() into one function. But the former is the 
 backend
 function and the latter is the frontend one. Now I have no good idea to
 merge them cleanly.
>>>
>>> We can wait on the socket wherever it comes from. poll/select doesn't
>>> care how we got the socket.
>>>
>>> So we just need a common handler that calls either
>>> walreceiver/libpqwalreceiver function as required to handle the
>>> wakeup.
>>
>> I'm afraid I could not understand your idea. Could you explain it in
>> more detail?
>
> We either tell libpqwalreceiver about the latch, or we tell
> walreceiver about the socket used by libpqwalreceiver.
>
> In either case we share a pointer from one module to another.

The former seems difficult because it's not easy to link libpqwalreceiver.so
to the latch. I will consider about the latter.

 If we send back the reply as soon as the Apply pointer is changed, I'm
 afraid quite lots of reply messages are sent frequently, which might
 cause performance problem. This is also one of the reasons why I didn't
 implement the quick-response feature. To address this problem, we might
 need to change the master so that it sends the Wait pointer to the standby,
 and change the standby so that it replies whenever the Apply pointer
 catches up with the Wait one. This can reduce the number of useless
 reply from the standby about the Apply pointer.
>>>
>>> We send back one reply per incoming message. The incoming messages
>>> don't know request state and checking that has a cost which I don't
>>> think is an appropriate payment since we only need this info when the
>>> link goes quiet.
>>>
>>> When the link goes quiet we still need to send replies if we have
>>> apply mode, but we only need to send apply messages if the lsn has
>>> changed because of a commit. That will considerably reduce the
>>> messages sent so I don't see a problem.
>>
>> You mean to change the meaning of apply_location? Currently it indicates
>> the end + 1 of the last replayed WAL record, regardless of whether it's
>> a commit record or not. So too many replies can be sent per incoming
>> message because it might contain many WAL records. But you mean to
>> change apply_location only when a commit record is replayed?
>
> There is no change to the meaning of apply_location. The only change
> is that we send that message only when it has an updated value of
> committed lsn.

This means that apply_location might return the different location from
pg_last_xlog_replay_location() on the standby, though in 9.1 they return
the same. Which might confuse a user. No?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Fujii Masao
On Tue, Jan 24, 2012 at 6:49 PM, Simon Riggs  wrote:
> On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao  wrote:
>> On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs  wrote:
>>> On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao  wrote:
 Why does walrestore need to be invoked even when restore_command is
 not specified? It seems to be useless. We invoke walreceiver only when
 primary_conninfo is specified now. Similarly we should invoke walrestore
 only when restore_command is specified?
>>>
>>> walreceiver is shutdown and restarted in case of failed connection.
>>> That never happens with walrestore because the command is run each
>>> time - when we issue system(3) a new process is forked to run the
>>> command. So there is no specific cleanup to perform and so no reason
>>> for a managed cleanup process.
>>>
>>> So I can't see a specific reason to change that. Do you think it makes
>>> a difference?
>>
>> Yes. When restore_command is not specified in recovery.conf, walrestore
>> process doesn't do any useful activity and just wastes CPU cycle. Which
>> might be harmless for a functionality of recovery, but ISTM it's better not
>> to start up walrestore in that case to avoid the waste of cycle.
>
> It just sleeps on a latch when it has nothing to do, so no wasted cycles.

Right, since walrestore process wakes up just every 10 seconds, a waste of
cycle is low. But what I feel uncomfortable is that walrestore process has
nothing to do *from start to end*, when restore_command is not specified,
but it's started up. I guess that many people would get surprised at that.
Of course, if restore_command can be changed without restarting the server,
I agree with you because walrestore process might do an useful activity
later. But currently not.

> Asking the postmaster seemed the easier option, I guess I could have
> chosen the other way also.
>
> I'll look at this when this is the last thing left to resolve to see
> if that improves things.

Okay.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2012-01-24 Thread Fujii Masao
On Mon, Jan 23, 2012 at 10:11 PM, Simon Riggs  wrote:
> On Mon, Jan 23, 2012 at 10:29 AM, Fujii Masao  wrote:
>> On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs  wrote:
>>> On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao  wrote:
 Thanks for the review!

 On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs  wrote:
> I'm looking at this patch and wondering why we're doing so many
> press-ups to ensure full_page_writes parameter is on. This will still
> fail if you use a utility that removes the full page writes, but fail
> silently.
>
> I think it would be beneficial to explicitly check that all WAL
> records have full page writes actually attached to them until we
> achieve consistency.

 I agree that it's worth adding such a safeguard. That can be a 
 self-contained
 feature, so I'll submit a separate patch for that, to keep each patch 
 small.
>>>
>>> Maybe, but you mean do this now as well? Not sure I like silent errors.
>>
>> If many people think the patch is not acceptable without such a safeguard,
>> I will do that right now. Otherwise, I'd like to take more time to do
>> that, i.e.,
>> add it to 9.2dev Oepn Items.
>
>> I've not come up with good idea. Ugly idea is to keep track of all replays of
>> full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page
>> table and set the specified bit to 1 when full_page_writes is applied),
>> and then check whether full_page_writes has been already applied when
>> replaying normal WAL record... Do you have any better idea?
>
> Not sure.
>
> I think the only possible bug here is one introduced by an outside utility.
>
> In that case, I don't think it should be the job of the backend to go
> too far to protect against such atypical error. So if we can't solve
> it fairly easily and with no overhead then I'd say lets skip it. We
> could easily introduce a bug here just by having faulty checking code.
>
> So lets add it to 9.2 open items as a non-priority item.

Agreed.

> I'll proceed to commit for this now.

Thanks a lot!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 9:43 AM, Fujii Masao  wrote:
> On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs  wrote:
>> On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao  wrote:
>>> Why does walrestore need to be invoked even when restore_command is
>>> not specified? It seems to be useless. We invoke walreceiver only when
>>> primary_conninfo is specified now. Similarly we should invoke walrestore
>>> only when restore_command is specified?
>>
>> walreceiver is shutdown and restarted in case of failed connection.
>> That never happens with walrestore because the command is run each
>> time - when we issue system(3) a new process is forked to run the
>> command. So there is no specific cleanup to perform and so no reason
>> for a managed cleanup process.
>>
>> So I can't see a specific reason to change that. Do you think it makes
>> a difference?
>
> Yes. When restore_command is not specified in recovery.conf, walrestore
> process doesn't do any useful activity and just wastes CPU cycle. Which
> might be harmless for a functionality of recovery, but ISTM it's better not
> to start up walrestore in that case to avoid the waste of cycle.

It just sleeps on a latch when it has nothing to do, so no wasted cycles.

Asking the postmaster seemed the easier option, I guess I could have
chosen the other way also.

I'll look at this when this is the last thing left to resolve to see
if that improves things.


>> Cleaned up the points noted, new patch attached in case you wish to
>> review further.
>>
>> Still has bug, so still with me to fix.
>
> Thanks! Will review further.

Much appreciated.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Restore process during recovery

2012-01-24 Thread Fujii Masao
On Tue, Jan 24, 2012 at 12:23 AM, Simon Riggs  wrote:
> On Mon, Jan 23, 2012 at 12:23 PM, Fujii Masao  wrote:
>> Why does walrestore need to be invoked even when restore_command is
>> not specified? It seems to be useless. We invoke walreceiver only when
>> primary_conninfo is specified now. Similarly we should invoke walrestore
>> only when restore_command is specified?
>
> walreceiver is shutdown and restarted in case of failed connection.
> That never happens with walrestore because the command is run each
> time - when we issue system(3) a new process is forked to run the
> command. So there is no specific cleanup to perform and so no reason
> for a managed cleanup process.
>
> So I can't see a specific reason to change that. Do you think it makes
> a difference?

Yes. When restore_command is not specified in recovery.conf, walrestore
process doesn't do any useful activity and just wastes CPU cycle. Which
might be harmless for a functionality of recovery, but ISTM it's better not
to start up walrestore in that case to avoid the waste of cycle.

> Cleaned up the points noted, new patch attached in case you wish to
> review further.
>
> Still has bug, so still with me to fix.

Thanks! Will review further.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Florian Weimer
> I would chip in and say that I would prefer sticking to well-known proved
> filesystems like xfs/ext4 and let the application do the checksumming.

Yes, that's a different way of putting my concern.  If you want a proven
file system with checksumming (and an fsck), options are really quite
limited.

> And yes, I would for sure turn such functionality on if it were present.

Same here.  I already use page-level checksum with Berkeley DB.

-- 
Florian Weimer
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread jesper
> * Robert Treat:
>
>> Would it be unfair to assert that people who want checksums but aren't
>> willing to pay the cost of running a filesystem that provides
>> checksums aren't going to be willing to make the cost/benefit trade
>> off that will be asked for? Yes, it is unfair of course, but it's
>> interesting how small the camp of those using checksummed filesystems
>> is.
>
> Don't checksumming file systems currently come bundled with other
> features you might not want (such as certain vendors)?

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

I dont forsee fully production-ready checksumming filesystems readily
available in the standard Linux distributions within a near future.

And yes, I would for sure turn such functionality on if it were present.

-- 
Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers