[HACKERS] MVCC catalog access

2013-05-21 Thread Robert Haas
We've had a number of discussions about the evils of SnapshotNow.  As
far as I can tell, nobody likes it and everybody wants it gone, but
there is concern about the performance impact.  I decided to do some
testing to measure the impact.  I was pleasantly surprised by the
results.

The attached patch is a quick hack to provide for MVCC catalog access.
 It adds a GUC called "mvcc_catalog_access".  When this GUC is set to
true, and heap_beginscan() or index_beginscan() is called with
SnapshotNow, they call GetLatestSnapshot() and use the resulting
snapshot in lieu of SnapshotNow.  As a debugging double-check, I
modified HeapTupleSatisfiesNow to elog(FATAL) if called with
mvcc_catalog_access is true.  Of course, both of these are dirty
hacks.  If we were actually to implement MVCC catalog access, I think
we'd probably just go through and start replacing instances of
SnapshotNow with GetLatestSnapshot(), but I wanted to make it easy to
do perf testing.

When I first made this change, I couldn't detect any real change;
indeed, it seemed that make check was running ever-so-slightly faster
than before, although that may well have been a testing artifact.  I
wrote a test case that created a schema with 100,000 functions in it
and then dropped the schema (I believe it was Tom who previously
suggested this test case as a worst-case scenario for MVCC catalog
access).  That didn't seem to be adversely affected either, even
though it must take ~700k additional MVCC snapshots with
mvcc_catalog_access = true.

MVCC Off: Create 8743.101 ms, Drop 9655.471 ms
MVCC On: Create 7462.882 ms, Drop 9515.537 ms
MVCC Off: Create 7519.160 ms, Drop 9380.905 ms
MVCC On: Create 7517.382 ms, Drop 9394.857 ms

The first "Create" seems to be artificially slow because of some kind
of backend startup overhead.  Not sure exactly what.

After wracking my brain for a few minutes, I realized that the lack of
any apparent performance regression was probably due to the lack of
any concurrent connections, making the scans of the PGXACT array very
cheap.  So I wrote a little program to open a bunch of extra
connections.  My MacBook Pro grumbled when I tried to open more than
about 600, so I had to settle for that number.  That was enough to
show up the cost of all those extra snapshots:

MVCC Off: Create 9065.887 ms, Drop 9599.494 ms
MVCC On: Create 8384.065 ms, Drop 10532.909 ms
MVCC Off: Create 7632.197 ms, Drop 9499.502 ms
MVCC On: Create 8215.443 ms, Drop 10033.499 ms

Now, I don't know about you, but I'm having a hard time getting
agitated about those numbers.  Most people are not going to drop
100,000 objects with a single cascaded drop.  And most people are not
going to have 600 connections open when they do.  (The snapshot
overhead should be roughly proportional to the product of the number
of drops and the number of open connections, and the number of cases
where the product is as high as 60 million has got to be pretty
small.)  But suppose that someone is in that situation.  Well, then
they will take a... 10% performance penalty?  That sounds plenty
tolerable to me, if it means we can start moving in the direction of
allowing some concurrent DDL.

Am I missing an important test case here?  Are these results worse
than I think they are?  Did I boot this testing somehow?

[MVCC catalog access patch, test program to create lots of idle
connections, and pg_depend stress test case attached.]

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


mvcc-catalog-access.patch
Description: Binary data
/*
 * pg_cxn.c
 */

#include 
#include 
#include 
#include "libpq-fe.h"

static void pg_connect(const char *conninfo);

int
main(int argc, char **argv)
{
	int		c;
	int		n = 1;
	int		optindex;
	int		i;
	const char *conninfo;

	while ((c = getopt_long(argc, argv, "n:", NULL, &optindex)) != -1)
	{
		switch (c)
		{
			case 'n':
n = atoi(optarg);
break;
			default:
fprintf(stderr, "Usage: pg_cxn [-n count] connstr\n");
exit(1);
		}
	}
	argv += optind;
	argc -= optind;

	if (argc > 0)
		conninfo = argv[0];
	else
		conninfo = "";

	for (i = 0; i < n; ++i)
		pg_connect(conninfo);

	while (1)
		sleep(3600);

	return 0;
}

static void
pg_connect(const char *conninfo)
{
	PGconn	   *conn;

	/* Make a connection to the database */
	conn = PQconnectdb(conninfo);

	/* Check to see that the backend connection was successfully made */
	if (PQstatus(conn) != CONNECTION_OK)
	{
		fprintf(stderr, "%s", PQerrorMessage(conn));
		exit(1);
	}
}


depend
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plperl segfault in plperl_trusted_init() on kfreebsd

2013-05-21 Thread Christoph Berg
Re: Andrew Dunstan 2013-05-17 <51964770.6070...@dunslane.net>
> I have reproduced this. It happens with both the distro perl and a
> home-built perl 5.14. AFAICT this is a Perl bug. Any reference at
> all to ERRSV at the point this occurs causes a core dump, even just
> assigning it to a local SV * variable. Maybe you should take this up
> with the Perl people. It certainly seems mysterious that this should
> only occur on this platform.

I've just tried to look into this but got lost in chasing about 5
nested layers of ERRSV #defines. :-/

The crash also happens with libperl5.18 (5.18.0-1) on unstable/kfreebsd-amd64.

./configure --with-perl; make; cd src/pl/perl; make clean

postgresql-9.3/src/pl/plperl $ make PROFILE="-g -O0"
'/usr/bin/perl' ./text2macro.pl --strip='^(\#.*|\s*)$' plc_perlboot.pl 
plc_trusted.pl > perlchunks.h
'/usr/bin/perl' plperl_opmask.pl plperl_opmask.h
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O0 -fpic -I. -I. -I../../../src/include -D_GNU_SOURCE  
-I/usr/lib/perl/5.18/CORE  -c -o plperl.o plperl.c
'/usr/bin/perl' /usr/share/perl/5.18/ExtUtils/xsubpp -typemap 
/usr/share/perl/5.18/ExtUtils/typemap SPI.xs >SPI.c
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O0 -fpic -I. -I. -I../../../src/include -D_GNU_SOURCE  
-I/usr/lib/perl/5.18/CORE  -c -o SPI.o SPI.c
'/usr/bin/perl' /usr/share/perl/5.18/ExtUtils/xsubpp -typemap 
/usr/share/perl/5.18/ExtUtils/typemap Util.xs >Util.c
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O0 -fpic -I. -I. -I../../../src/include -D_GNU_SOURCE  
-I/usr/lib/perl/5.18/CORE  -c -o Util.o Util.c
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g 
-O0 -fpic -shared -o plperl.so plperl.o SPI.o Util.o -L../../../src/port 
-L../../../src/common -Wl,--as-needed 
-Wl,-rpath,'/usr/lib/perl/5.18/CORE',--enable-new-dtags -g -O0  
-fstack-protector -L/usr/local/lib  -L/usr/lib/perl/5.18/CORE -lperl -ldl -lm 
-lpthread -lc -lcrypt 

postgresql-9.3/src/pl/plperl $ make check
[...]
../../../src/test/regress/pg_regress --inputdir=. --temp-install=./tmp_check 
--top-builddir=../../..   --dbname=pl_regression --load-extension=plperl  
--load-extension=plperlu plperl plperl_lc plperl_trigger plperl_shared 
plperl_elog plperl_util plperl_init plperlu plperl_array plperl_plperlu
[...]
== running regression test queries==
test plperl   ... FAILED (test process exited with exit code 2)
[...]

#0  0x00080a4274ed in plperl_trusted_init () at plperl.c:937
937 if (SvTRUE(ERRSV))
#1  0x00080a42602d in select_perl_context (trusted=1 '\001') at plperl.c:600
#2  0x00080a430cb1 in compile_plperl_function (fn_oid=49152, is_trigger=0 
'\000') at plperl.c:2661
#3  0x00080a42ca4d in plperl_validator (fcinfo=0x7fffbeb0) at 
plperl.c:1900
#4  0x0071ffb6 in OidFunctionCall1Coll ()
#5  0x004ec3b7 in ProcedureCreate ()
#6  0x0054c4ee in CreateFunction ()
#7  0x0065889f in ProcessUtilitySlow.isra.2 ()
#8  0x00657b4c in standard_ProcessUtility ()
#9  0x00655467 in PortalRunUtility ()
#10 0x00656055 in PortalRunMulti ()
#11 0x00656b42 in PortalRun ()
#12 0x00653ef3 in PostgresMain ()
#13 0x00464471 in ServerLoop ()
#14 0x006144e1 in PostmasterMain ()
#15 0x00464cab in main ()

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SET work_mem = '1TB';

2013-05-21 Thread Gavin Flower

On 22/05/13 09:13, Simon Riggs wrote:

I worked up a small patch to support Terabyte setting for memory.
Which is OK, but it only works for 1TB, not for 2TB or above.

Which highlights that since we measure things in kB, we have an
inherent limit of 2047GB for our memory settings. It isn't beyond
belief we'll want to go that high, or at least won't be by end 2014
and will be annoying sometime before 2020.

Solution seems to be to support something potentially bigger than INT
for GUCs. So we can reclassify GUC_UNIT_MEMORY according to the
platform we're on.

Opinions?

--
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

I suspect it should be fixed before it starts being a problem, for 2 
reasons:


1. best to panic early while we have time
   (or more prosaically: doing it soon gives us more time to get it
   right without undue pressure)

2. not able to cope with 2TB and above might put off companies with
   seriously massive databases from moving to Postgres

Probably an idea to check what other values should be increased as well.


Cheers,
Gavin


[HACKERS] SET work_mem = '1TB';

2013-05-21 Thread Simon Riggs
I worked up a small patch to support Terabyte setting for memory.
Which is OK, but it only works for 1TB, not for 2TB or above.

Which highlights that since we measure things in kB, we have an
inherent limit of 2047GB for our memory settings. It isn't beyond
belief we'll want to go that high, or at least won't be by end 2014
and will be annoying sometime before 2020.

Solution seems to be to support something potentially bigger than INT
for GUCs. So we can reclassify GUC_UNIT_MEMORY according to the
platform we're on.

Opinions?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


terabyte_work_mem.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 09:26, Simon Riggs  wrote:

> I'm OK with that principle...

Well, after fighting some more with that, I've gone with the, er,
principle of slightly less ugliness.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_export_snapshot on standby side

2013-05-21 Thread Fujii Masao
Hi,

We cannot run parallel pg_dump on the standby server because
pg_export_snapshot() always fails on the standby. Is this the oversight
of parallel pg_dump or pg_export_snapshot()?

pg_export_snapshot() fails in the standby because it always assigns
new XID and which is not allowed in the standby. Do we really need
to assign new XID even in the standby for the exportable snapshot?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

2013-05-21 Thread Benedikt Grundmann
We are seeing these errors on a regular basis on the testing box now.  We
have even changed the backup script to
shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
the lvm snapshot.  It still happens.

We have never seen this before we introduced the hot standby.  So we will
now revert to taking the backups from lvm snapshots on the production
database.  If you have ideas of what else we should try / what information
we can give you to debug this let us know and we will try to so.

Until then we will sadly operate on the assumption that the combination of
hot standby and "frozen snapshot" backup of it is not production ready.

Thanks,

Bene




On Thu, May 16, 2013 at 8:10 AM, David Powers wrote:

> I'll try to get the primary upgraded over the weekend when we can afford a
> restart.
>
> In the meantime I have a single test showing that a shutdown, snapshot,
> restart produces a backup that passes the vacuum analyze test.  I'm going
> to run a full vacuum today.
>
> -David
>
>
> On Wed, May 15, 2013 at 3:53 PM, Heikki Linnakangas <
> hlinnakan...@vmware.com> wrote:
>
>> On 15.05.2013 22:50, Benedikt Grundmann wrote:
>>
>>> On Wed, May 15, 2013 at 2:50 PM, Heikki Linnakangas>> vmware.com 
>>>
 The subject says 9.2.3. Are you sure you're running 9.2.4 on all the

 servers? There was a fix to a bug related to starting a standby server
 from
 a filesystem snapshot. I don't think it was quite the case you have, but
 pretty close.

>>>
>>> So this is delightfully embarrassing I just went back to double check and
>>>
>>> - primary box is 9.2.3
>>> - standby is 9.2.4
>>> - testing is 9.2.4
>>>
>>> I guess that alone could possibly explain it?
>>>
>>
>> Hmm, no, it should still work. There haven't been any changes in the WAL
>> format. I do recommend upgrading the primary, of course, but I don't really
>> see how that would explain what you're seeing.
>>
>> - Heikki
>>
>
>


Re: [HACKERS] fast promotion and log_checkpoints

2013-05-21 Thread Simon Riggs
On 21 May 2013 15:29, Fujii Masao  wrote:

> Or, what about using CHECKPOINT_FORCE and just printing "force"?
> Currently that checkpoint always starts because of existence of the
> end-of-recovery record, but I think we should ensure that the checkpoint
> always starts by using that flag.

This would mean we can't use the secondary checkpoint record, but we
already gave that up so should be OK.

Three people, three suggestions; so I will agree to this suggestion so
we can get on with it.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fast promotion and log_checkpoints

2013-05-21 Thread Fujii Masao
On Tue, May 21, 2013 at 4:44 AM, Simon Riggs  wrote:
> On 20 May 2013 20:06, Heikki Linnakangas  wrote:
>
>>> It would be possible to redesign this with a special new reason, or we
>>> could just use "time" as the reason, or we could just leave it.
>>>
>>> Do nothing is easy, though so are the others, so we can choose
>>> anything we want. What do we want it to say?
>>
>>
>> I'm not sure. Perhaps we should print "(no flags)", so that it wouldn't look
>> like there's something missing in the log message.
>
> The reason text would still be absent, so it wouldn't really help the
> user interpret the log message correctly.
>
> I suggest we use RequestCheckpoint(CHECKPOINT_CAUSE_TIME) instead,
> since it is literally time for a checkpoint.

Or, what about using CHECKPOINT_FORCE and just printing "force"?
Currently that checkpoint always starts because of existence of the
end-of-recovery record, but I think we should ensure that the checkpoint
always starts by using that flag.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-21 Thread Josh Berkus

> Presumably we would want to repeat all of the ordinary commands, in the
> file, but not any of the backslash set commands that precede any ordinary
> commands.  But what if backslash set commands are sprinkled between
> ordinary commands?

See, this is why I had no intention of retrying.  Since people can run
custom workloads, there's no good way for us to handle the different
permutations of what kinds of scripts people might write.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-21 Thread Amit Kapila
On Tuesday, May 21, 2013 12:36 PM Amit Kapila wrote:
> On Monday, May 20, 2013 6:54 PM Robert Haas wrote:
> > On Thu, May 16, 2013 at 10:18 AM, Amit Kapila
> 
> > wrote:
> > > Further Performance Data:
> > >
> > > Below data is for average 3 runs of 20 minutes
> > >
> > > Scale Factor   - 1200
> > > Shared Buffers - 7G
> >
> > These results are good but I don't get similar results in my own
> > testing.
> 
> Thanks for running detailed tests
> 
> > I ran pgbench tests at a variety of client counts and scale
> > factors, using 30-minute test runs and the following non-default
> > configuration parameters.
> >
> > shared_buffers = 8GB
> > maintenance_work_mem = 1GB
> > synchronous_commit = off
> > checkpoint_segments = 300
> > checkpoint_timeout = 15min
> > checkpoint_completion_target = 0.9
> > log_line_prefix = '%t [%p] '
> >
> > Here are the results.  The first field in each line is the number of
> > clients. The second number is the scale factor.  The numbers after
> > "master" and "patched" are the median of three runs.
> >
> > 01 100 master 1433.297699 patched 1420.306088
> > 01 300 master 1371.286876 patched 1368.910732
> > 01 1000 master 1056.891901 patched 1067.341658
> > 01 3000 master 637.312651 patched 685.205011
> > 08 100 master 10575.017704 patched 11456.043638
> > 08 300 master 9262.601107 patched 9120.925071
> > 08 1000 master 1721.807658 patched 1800.733257
> > 08 3000 master 819.694049 patched 854.333830
> > 32 100 master 26981.677368 patched 27024.507600
> > 32 300 master 14554.870871 patched 14778.285400
> > 32 1000 master 1941.733251 patched 1990.248137
> > 32 3000 master 846.654654 patched 892.554222
> 
> 
> Is the above test for tpc-b?
> In the above tests, there is performance increase from 1~8% and
> decrease
> from 0.2~1.5%
> 
> > And here's the same results for 5-minute, read-only tests:
> >
> > 01 100 master 9361.073952 patched 9049.553997
> > 01 300 master 8640.235680 patched 8646.590739
> > 01 1000 master 8339.364026 patched 8342.799468
> > 01 3000 master 7968.428287 patched 7882.121547
> > 08 100 master 71311.491773 patched 71812.899492
> > 08 300 master 69238.839225 patched 70063.632081
> > 08 1000 master 34794.778567 patched 65998.468775
> > 08 3000 master 60834.509571 patched 61165.998080
> > 32 100 master 203168.264456 patched 205258.283852
> > 32 300 master 199137.276025 patched 200391.633074
> > 32 1000 master 177996.853496 patched 176365.732087
> > 32 3000 master 149891.147442 patched 148683.269107
> >
> > Something appears to have screwed up my results for 8 clients @ scale
> > factor 300 on master,
> 
>   Do you want to say the reading of 1000 scale factor?
> 
> >but overall, on both the read-only and
> > read-write tests, I'm not seeing anything that resembles the big
> gains
> > you reported.
> 
> I have not generated numbers for read-write tests, I will check that
> once.
> For read-only tests, the performance increase is minor and different
> from
> what I saw.
> Few points which I could think of for difference in data:
> 
> 1. In my test's I always observed best data when number of
> clients/threads
> are equal to number of cores which in your case should be at 16.
> 2. I think for scale factor 100 and 300, there should not be much
> performance increase, as for them they should mostly get buffer from
> freelist inspite of even bgwriter adds to freelist or not.
> 3. In my tests variance is for shared buffers, database size is always
> less
> than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but
> due
> to variance in shared buffers, it can lead to I/O.
> 4. Each run is of 20 minutes, not sure if this has any difference.
> 
> > Tests were run on a 16-core, 64-hwthread PPC64 machine provided to
> the
> > PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.
> 
> To think about the difference in your and my runs, could you please
> tell me
> about below points
> 1. What is RAM in machine.
> 2. Are number of threads equal to number of clients.
> 3. Before starting tests I have always done pre-warming of buffers
> (used
> pg_prewarm written by you last year), is it same for above read-only
> tests.
> 4. Can you please once again run only the test where you saw
> variation(8
> clients @ scale> factor 1000 on master), because I have also seen that
> performance difference is very good for certain
>configurations(Scale Factor, RAM, Shared Buffers)

On looking more closely at data posted by you, I believe that there is some
problem with data (8
clients @ scale factor 1000 on master) as in all other cases, the data for
scale factor 1000 is better than 3000 except for this case.
So I think no need to run again.

> Apart from above, I had one more observation during my investigation to
> find
> why in some cases, there is a small dip:
> 1. Many times, it finds the buffer in free list is not usable, means
> it's
> refcount or usage count is not zero, due to which it had to spend more
> time
> under BufFreelistLock.
>I had not any furth

Re: [HACKERS] Fast promotion failure

2013-05-21 Thread Simon Riggs
On 21 May 2013 07:46, Heikki Linnakangas  wrote:
> On 21.05.2013 00:00, Simon Riggs wrote:
>>
>> When we set the new timeline we should be updating all values that
>> might be used elsewhere. If we do that, then no matter when or how we
>> run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.
>>
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -5838,8 +5838,16 @@ StartupXLOG(void)
>> }
>>
>> /* Save the selected TimeLineID in shared memory, too */
>> -   XLogCtl->ThisTimeLineID = ThisTimeLineID;
>> -   XLogCtl->PrevTimeLineID = PrevTimeLineID;
>> +   {
>> +   /* use volatile pointer to prevent code rearrangement */
>> +   volatile XLogCtlData *xlogctl = XLogCtl;
>> +
>> +   SpinLockAcquire(&xlogctl->info_lck);
>> +   XLogCtl->ThisTimeLineID = ThisTimeLineID;
>> +   XLogCtl->lastReplayedTLI = ThisTimeLineID;
>> +   XLogCtl->PrevTimeLineID = PrevTimeLineID;
>> +   SpinLockRelease(&xlogctl->info_lck);
>> +   }
>
>
> Hmm. lastReplayedTLI is supposed to be the timeline of the last record that
> was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. I think
> it would work, but it feels like it could be an easy source of bugs in the
> future.

I'm OK with that principle, as long as we don't touch ThisTimeLineID,
which has been the source of multiple bugs.

So we should set the TLI explicitly before installing, like attached patch.

Otherwise we'd need multiple permanent TLIs which would be overkill.

I feel there are problems because we set the newly selected TLI from
startup process into shared memory, then some time later we set
SharedRecoveryInProgress = false. That timing window isn't good, but I
don't see a different way.

> It might be a good idea to change walsender to not store that in
> ThisTimeLineID, though. It used to make sense for ThisTimeLineID to track
> the current recovery timeline in 9.2 and below, but now that walsender can
> be sending any older timeline, using ThisTimeLineID to store the latest one
> seems confusing.

Agreed, but looks like too much code to touch that lightly.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


install_xlog_right.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-05-21 Thread Amit Kapila
On Monday, May 20, 2013 6:54 PM Robert Haas wrote:
> On Thu, May 16, 2013 at 10:18 AM, Amit Kapila 
> wrote:
> > Further Performance Data:
> >
> > Below data is for average 3 runs of 20 minutes
> >
> > Scale Factor   - 1200
> > Shared Buffers - 7G
> 
> These results are good but I don't get similar results in my own
> testing.  

Thanks for running detailed tests

> I ran pgbench tests at a variety of client counts and scale
> factors, using 30-minute test runs and the following non-default
> configuration parameters.
> 
> shared_buffers = 8GB
> maintenance_work_mem = 1GB
> synchronous_commit = off
> checkpoint_segments = 300
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
> log_line_prefix = '%t [%p] '
> 
> Here are the results.  The first field in each line is the number of
> clients. The second number is the scale factor.  The numbers after
> "master" and "patched" are the median of three runs.
> 
> 01 100 master 1433.297699 patched 1420.306088
> 01 300 master 1371.286876 patched 1368.910732
> 01 1000 master 1056.891901 patched 1067.341658
> 01 3000 master 637.312651 patched 685.205011
> 08 100 master 10575.017704 patched 11456.043638
> 08 300 master 9262.601107 patched 9120.925071
> 08 1000 master 1721.807658 patched 1800.733257
> 08 3000 master 819.694049 patched 854.333830
> 32 100 master 26981.677368 patched 27024.507600
> 32 300 master 14554.870871 patched 14778.285400
> 32 1000 master 1941.733251 patched 1990.248137
> 32 3000 master 846.654654 patched 892.554222


Is the above test for tpc-b?
In the above tests, there is performance increase from 1~8% and decrease
from 0.2~1.5%

> And here's the same results for 5-minute, read-only tests:
> 
> 01 100 master 9361.073952 patched 9049.553997
> 01 300 master 8640.235680 patched 8646.590739
> 01 1000 master 8339.364026 patched 8342.799468
> 01 3000 master 7968.428287 patched 7882.121547
> 08 100 master 71311.491773 patched 71812.899492
> 08 300 master 69238.839225 patched 70063.632081
> 08 1000 master 34794.778567 patched 65998.468775
> 08 3000 master 60834.509571 patched 61165.998080
> 32 100 master 203168.264456 patched 205258.283852
> 32 300 master 199137.276025 patched 200391.633074
> 32 1000 master 177996.853496 patched 176365.732087
> 32 3000 master 149891.147442 patched 148683.269107
> 
> Something appears to have screwed up my results for 8 clients @ scale
> factor 300 on master, 

  Do you want to say the reading of 1000 scale factor?
  
>but overall, on both the read-only and
> read-write tests, I'm not seeing anything that resembles the big gains
> you reported.

I have not generated numbers for read-write tests, I will check that once.
For read-only tests, the performance increase is minor and different from
what I saw. 
Few points which I could think of for difference in data:

1. In my test's I always observed best data when number of clients/threads
are equal to number of cores which in your case should be at 16.
2. I think for scale factor 100 and 300, there should not be much
performance increase, as for them they should mostly get buffer from
freelist inspite of even bgwriter adds to freelist or not. 
3. In my tests variance is for shared buffers, database size is always less
than RAM (Scale Factor -1200, approx db size 16~17GB, RAM -24 GB), but due
to variance in shared buffers, it can lead to I/O. 
4. Each run is of 20 minutes, not sure if this has any difference.
 
> Tests were run on a 16-core, 64-hwthread PPC64 machine provided to the
> PostgreSQL community courtesy of IBM.  Fedora 16, Linux kernel 3.2.6.

To think about the difference in your and my runs, could you please tell me
about below points
1. What is RAM in machine.
2. Are number of threads equal to number of clients.
3. Before starting tests I have always done pre-warming of buffers (used
pg_prewarm written by you last year), is it same for above read-only tests.
4. Can you please once again run only the test where you saw variation(8
clients @ scale> factor 1000 on master), because I have also seen that
performance difference is very good for certain
   configurations(Scale Factor, RAM, Shared Buffers)

Apart from above, I had one more observation during my investigation to find
why in some cases, there is a small dip:
1. Many times, it finds the buffer in free list is not usable, means it's
refcount or usage count is not zero, due to which it had to spend more time
under BufFreelistLock.
   I had not any further experiments related to this finding like if it
really adds any overhead.

Currently I am trying to find reasons for small dip of performance and see
if I could do something to avoid it. Also I will run tests with various
configurations.

Any other suggestions?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers