Re: [HACKERS] Multixid hindsight design

2015-11-12 Thread Robert Haas
On Mon, Nov 9, 2015 at 9:57 PM, Craig Ringer  wrote:
> On 10 November 2015 at 02:26, Robert Haas  wrote:
>>> I'd like to see, say, python and the 'unittest' module added, and
>>> to see acceptance of tests using psycopg2 to stream and decode WAL
>>> from a slot.
>>
>> I actually kind of hate this sort of thing.
>
> For what it's worth, I don't actually like it either. However, I
> suspect that the current restricted tools are a significant impediment
> to test tooling and infrastructure improvement.

But that's exactly the argument that was made to justify adding the
TAP testing framework.  It didn't work?

>> Now I'm not dead set against changing anything at all here, but I want
>> to point out that we just added the TAP framework and already there
>> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
>> for some tests, which a lot of people aren't going to have
>
> The alternative seems to be driving psql -qAtz as a coprocess over a
> pipeline as a poor man's database driver. I really doubt that's any
> kind of improvement when it comes to the complexity of maintaining
> tests, fixing them when they break, etc. I'd rather just write tests
> in C against libpq despite the verbosity, annoyance of maintaining
> them, etc.

I thought about an idea like this: have some sort of dummy program,
and then have a module written in Perl that provides an interface to
it, so that we don't have to depend on DBD::Pg.  I'm not at all
convinced that's a bad plan.

> Part of why I'm somewhat interested in seeing python and psycopg2
> added is that I'd personally quite like to see widely used client
> drivers covered by the buildfarm to some extent specifically to see if
> we break them. Though really, it'd make sense to add the most
> important drivers like psqlODBC and PgJDBC first if doing that and
> that'd be a whole new barrel of fun.

EnterpriseDB has something like this, and it does catch some bugs, but
it is an unbelievable amount of work to keep it working.  When
something fails, it's unlikely that any individual developer has
exactly the right configuration on their machine to reproduce the
failure.  Now maybe we could come up with something that provides
similar test coverage but is less work to maintain - I'm certainly not
going to rule out the possibility that it can be done better than
EnterpriseDB has done it.  However, I bet it will still be a lot of
work, and unlike at EntepriseDB where we've got entire teams of people
who spend major amounts of time on this kind of stuff, PostgreSQL
relies on its committers to keep the buildfarm green.

And if I commit something and it breaks the Win32 version of some test
that is written in Python and relies on pgsqlODBC, there is no way in
heck I'm committing myself to reproducing that environment.  Even if
it's Linux64 and JDBC I'm not doing it.  If you want to build such a
thing and maintain it in perpetuity, that is fine, and you don't need
my permission, and I and I'm sure many others will greatly appreciate
both the effort and any bugs that get found and fixed that way.  But
when you talk about putting this kind of thing to core you are asking
the committers to be responsible for it, and I'm still not very
confident we've got the TAP tests in a state where maintaining them is
going to be a reasonable level of effort.  Those tests failed on my
machine for months, and the only reason they eventually got fixed is
because I knew enough Perl to track the problem down to some screwy
coding in one of the Perl modules MacPorts installs.  If the same
thing happens with Python, it would be extremely difficult for me to
find it, because I don't actually know Python.

>> Accepting more tools also increases the breadth of knowledge expected
>> of committers and patch authors.
>
> That's the real problem IMO. The cognitive and knowledge load.

That's definitely a big part of it, but I think you're selling short
the amount of havoc that build dependencies cause.  On a Linux system
where you can install whatever you need via 'yum install', sure, it's
easy.  It's not so easy on Windows.  It's less easy on other UNIX
systems without modern package managers.  But that's just the tip of
the iceberg.  Installing your operating system's DBD::Pg package is
going to install a version of DBD::Pg compiled against the system's
libpq, not the latest PostgreSQL libpq that you just built.  Now
perhaps, you will say, that's OK: that way we test cross-version libpq
compatibility, too!  And on a certain level I can't argue with that.
But now you could have a situation where the tests pass for one user
and fail for another user even though both are compiling from the same
PostgreSQL commit and the same DBD::Pg version, and that kind of stuff
is a lot of work to track down.  And it may not really be a bug: maybe
the person who added the test intended the test to exercise some new
libpq feature which is indeed supported by DBD::Pg, 

Re: [HACKERS] Multixid hindsight design

2015-11-09 Thread Robert Haas
On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer  wrote:
> On 25 June 2015 at 00:52, Robert Haas  wrote:
>> I agree that we can do much better at testing than we traditionally
>> have done, and I think pretty much everyone in the room for the
>> developer unconference session on testing at PGCon was also in
>> agreement with that.  I really like the idea of taking purpose-built
>> testing frameworks - like the one that Heikki created for the WAL
>> format changes - and polishing them to the point where they can go
>> into core.  That's more work, of course, but very beneficial.
>
> Something that'd really help with that, IMO, would be weakening the
> requirement that everything be C or be core Perl. Instead require that
> configure detect whether or not facilities for some tests are present,
> and have them fail with a clean warning indicating they were skipped
> for lack of pre-requisites at 'make' time.
>
> I don't see that as significantly different to having some buildfarm
> animals not bother to configure or test the PLs, SSL, etc. I
> understand why adding to the mix required for the core server build
> isn't acceptable, but hopefully separate test suites can be more
> flexible. A free-for-all of languages and tools doesn't make sense,
> but I'd like to see, say, python and the 'unittest' module added, and
> to see acceptance of tests using psycopg2 to stream and decode WAL
> from a slot.
>
> Thoughts?

I actually kind of hate this sort of thing.  It's already the case
that some of the TAP tests are skilled if you don't have the
prerequisites, and while that solves the problem of spurious test
failures, it also means that the tests which have those dependencies
are run in fewer and fewer places.

Now I'm not dead set against changing anything at all here, but I want
to point out that we just added the TAP framework and already there
are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
for some tests, which a lot of people aren't going to have, and I know
you and probably a few other people would like python, which is fair
enough, but other people are going to want other things, and pretty
soon we we end up with a situation - if we're not already there -
where for a developer or packager to get a machine to the point where
it runs "all the tests" is a major undertaking.

Accepting more tools also increases the breadth of knowledge expected
of committers and patch authors.  If I commit a patch and it turns out
something breaks, I'm expected to fix that.  Now if it's in C or Perl,
I can.  If it's in Python, I can't.  If we add Python to the list of
things we test with, there will (I'm sure) be other committers who can
fix the C and Python stuff, but have no clue about the Perl.  The more
frameworks we support, the worse this gets.  And woe betide a patch
author whose feature requires adjustment of existing tests - if that
person actually notices the problem before commit, as opposed to the
buildfarm finding it, that person now needs to learn enough of every
language in which we support tests to update the existing tests and
add any new ones which are needed.  I'm starting to get the hang of
isolation tests, but I still don't *really* understand TAP tests,
really, and anything new we add is a whole new obstacle.

I think this is sort of an inherent problem with long-running
projects.  Everybody knows, for example, that the toolchain we're
using to build our documentation is pretty crufty.  But if we switched
to something else that used a different input format, then every
back-patchable doc fix would have to be done twice, once for the old
toolchain and once for the new, which means that at least some people
would have to understand both, for a good 5-6 years.  We'd also have
to get that new toolchain working on every platform we support, and
all of our packagers would have to switch to it, and of course we'd
want it to support all the output formats we have today.  Yet, if we
were starting over, I think there's about zero chance we'd pick this
approach.  It's just that switching now is a job and a half, and we
don't want to take that on likely.  Similarly with testing frameworks
- except those are even worse, because nobody's saying that any of the
stuff we have now will ever go away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-11-09 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 9, 2015 at 2:02 AM, Craig Ringer  wrote:
>> Something that'd really help with that, IMO, would be weakening the
>> requirement that everything be C or be core Perl. Instead require that
>> configure detect whether or not facilities for some tests are present,
>> and have them fail with a clean warning indicating they were skipped
>> for lack of pre-requisites at 'make' time.

> I actually kind of hate this sort of thing.

FWIW, I agree with Robert's position.  Expanding the infrastructure
required for tests may make life simpler for the initial author of a test,
but we pay for that over and over again when you consider the long run.

I think insisting on having e.g. DBD::Pg is simply exporting the author's
work to other people; there isn't anything that that enables that you
can't do without it.  We have added test infrastructure when there simply
wasn't any other way to test something.  Both the isolation framework and
the TAP framework represent substantial investments of that sort.  I'm
fine with both of those.  But the bar to adding new requirements ought to
be pretty darn high, and not just "I was too lazy to write it without".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-11-09 Thread Craig Ringer
On 10 November 2015 at 02:26, Robert Haas  wrote:
>> I'd like to see, say, python and the 'unittest' module added, and
>> to see acceptance of tests using psycopg2 to stream and decode WAL
>> from a slot.
>
> I actually kind of hate this sort of thing.

For what it's worth, I don't actually like it either. However, I
suspect that the current restricted tools are a significant impediment
to test tooling and infrastructure improvement.

> Now I'm not dead set against changing anything at all here, but I want
> to point out that we just added the TAP framework and already there
> are proposals (like Kevin's snapshot too old patch) to require DBD::Pg
> for some tests, which a lot of people aren't going to have

The alternative seems to be driving psql -qAtz as a coprocess over a
pipeline as a poor man's database driver. I really doubt that's any
kind of improvement when it comes to the complexity of maintaining
tests, fixing them when they break, etc. I'd rather just write tests
in C against libpq despite the verbosity, annoyance of maintaining
them, etc.

Part of why I'm somewhat interested in seeing python and psycopg2
added is that I'd personally quite like to see widely used client
drivers covered by the buildfarm to some extent specifically to see if
we break them. Though really, it'd make sense to add the most
important drivers like psqlODBC and PgJDBC first if doing that and
that'd be a whole new barrel of fun.

> and I know
> you and probably a few other people would like python, which is fair
> enough, but other people are going to want other things, and pretty
> soon we we end up with a situation - if we're not already there -
> where for a developer or packager to get a machine to the point where
> it runs "all the tests" is a major undertaking.

That's the bit I find odd, I guess. It's *trivial* to add DBD::Pg on
any remotely sane host. Similarly Python and psycopg2. One package
each, or a very simple compile. It's nothing in comparison to setting
up to building our documentation... or getting pretty much any of the
basic libraries in place on Windows.

(However, the streaming replication support for psycopg2 hasn't landed
in mainline yet, so you can't just grab it from PGDG repos).

Adding more tools does not make the existing ones easier, of course,
so that's not really an argument in favour. I just suspect the cost of
adding them is being somewhat overblown. That said ... I don't
maintain a 1990s-era MIPS box, an old ARM 6, or whatever.

It's a pity in a way that the Java toolchain and build process is so
heavy, because otherwise it'd make sense to just use Java and PgJDBC.
PgJDBC development happens closer to core Pg than most drivers, within
the main Pg community. It's also crucial for PostgreSQL's success,
with huge adoption and a lot of people relying on it. JUnit is one of
the least awful test frameworks I've worked with. Java is widely
understood and relatively easy to follow even if you don't know it
well. Unfortunately I see around zero chance of getting a JVM, maven
or ant, and a bajillion libraries onto buildfarm members.

> Accepting more tools also increases the breadth of knowledge expected
> of committers and patch authors.

That's the real problem IMO. The cognitive and knowledge load.

It's exactly the same reason why I just explained to someone else (the
not-UTF32 network run by the secret gov't conspiracy/aliens dude, as
it turns out) why the codebase requires English identifiers, variable
names, comments, etc. While I'd argue that the cost of adding Python
localized to some tests is lower than the cost of adding (say) code in
Russian to a contrib, it's not zero.

So I guess I've already argued against myself in another form.

Still, the alternative seems to be writing new bespoke tools on top of
libpq. Need something to analyse and decode a WAL stream? Write it in
C on top of libpq, add it to the test infrastructure. Probably write
new tests as plugins in C, too, or use an embedded Perl runtime, or
XS. Or concoct yet another custom language / test description, like
isolationtester, for the purpose. Those tools aren't much easier to
get rid of, once added, than new languages etc are.

Eventually it'd be good to be able to actually test things like
streaming replication in an automated manner. We're going to hit the
limitations of 'do some stuff and diff the output files' pretty
quickly there. Adding a new scripting language isn't exactly going to
fix that though.

> Similarly with testing frameworks
> - except those are even worse, because nobody's saying that any of the
> stuff we have now will ever go away.

Yeah. I guess that's the issue, isn't it. When something gets added
we're generally stuck with it forever. We still have PL/TCL!

Even if it were desirable (and ignoring the horrific backpatching pain
that'd result), nobody's going to leap up and volunteer to rewrite all
the Perl like the Windows build system so we could remove Perl. It's
there 

Fwd: [HACKERS] Multixid hindsight design

2015-11-08 Thread Craig Ringer
On 25 June 2015 at 00:52, Robert Haas  wrote:

> I agree that we can do much better at testing than we traditionally
> have done, and I think pretty much everyone in the room for the
> developer unconference session on testing at PGCon was also in
> agreement with that.  I really like the idea of taking purpose-built
> testing frameworks - like the one that Heikki created for the WAL
> format changes - and polishing them to the point where they can go
> into core.  That's more work, of course, but very beneficial.

Something that'd really help with that, IMO, would be weakening the
requirement that everything be C or be core Perl. Instead require that
configure detect whether or not facilities for some tests are present,
and have them fail with a clean warning indicating they were skipped
for lack of pre-requisites at 'make' time.

I don't see that as significantly different to having some buildfarm
animals not bother to configure or test the PLs, SSL, etc. I
understand why adding to the mix required for the core server build
isn't acceptable, but hopefully separate test suites can be more
flexible. A free-for-all of languages and tools doesn't make sense,
but I'd like to see, say, python and the 'unittest' module added, and
to see acceptance of tests using psycopg2 to stream and decode WAL
from a slot.

Thoughts?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-24 Thread Robert Haas
On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas robertmh...@gmail.com wrote:
 It would be a great deal nicer if we didn't have to keep ANY of the
 transactional data for a tuple around once it's all-visible.  Heikki
 defined ephemeral as only needed when xmin or xmax is in-progress,
 but if we extended that definition slightly to only needed when xmin
 or xmax is in-progress or commited but not all-visible then the
 amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
 t_hoff).

OK, I was wrong here: if you only have that stuff, you can't
distinguish between a tuple that is visible to everyone and a tuple
that is visible to no one.  I think the minimal amount of data we need
in order to distinguish visibility once no relevant transactions are
in progress is one XID: either XMIN, if the tuple was never updated at
all or only be the inserting transaction or one of its subxacts; or
XMAX, if the inserting transaction committed.  The other visibility
information -- including (1) the other of XMIN and XMAX, (2) CMIN and
CMAX, and (3) the CTID -- are only interesting the transactions
involved are no longer running and, if they committed, visible to all
running transactions.

Heikki's proposal is basically to merge the 4-byte CID field and the
first 4 bytes of the CTID that currently store the block number into
one 8-byte field that can store a pointer into this new TED structure.
And after mulling it over, that sounds pretty good to me.  It's true
(as has been pointed out by several people) that the TED will need to
be persistent because of prepared transactions.  But it would still be
a big improvement over the status quo, because:

(1) We would no longer need to freeze MultiXacts.  TED wouldn't need
to be frozen either.  You'd just truncate it whenever RecentGlobalXmin
advances.

(2) If the TED becomes horribly corrupted, you can recover by
committing or aborting any prepared transactions, shutting the system
down, and truncating it, with no loss of data integrity.  Nothing in
the TED is required to determine whether tuples are visible to an
unrelated transaction - you only need it (a) to determine whether
tuples are visible to a particular command within a transaction that
has inserted, updated, or deleted the tuple and (b) determine whether
tuples are locked.

(3) As a bonus, we'd eliminate combo CIDs, because the TED could have
space to separately store CMIN and CMAX.  Combo CIDs required special
handling for logical decoding, and they are one of the nastier
barriers to making parallelism support writes (because they are stored
in backend-local memory of unbounded size and therefore can't easily
be shared with workers), so it wouldn't be very sad if they went away.

I'm not quite sure how to decide whether something like this worth (a)
the work and (b) the risk of creating new bugs, but the more I think
about it, the more the principal of the thing seems sound to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-24 Thread Simon Riggs
On 24 June 2015 at 14:57, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jun 5, 2015 at 10:46 AM, Robert Haas robertmh...@gmail.com
 wrote:
  It would be a great deal nicer if we didn't have to keep ANY of the
  transactional data for a tuple around once it's all-visible.  Heikki
  defined ephemeral as only needed when xmin or xmax is in-progress,
  but if we extended that definition slightly to only needed when xmin
  or xmax is in-progress or commited but not all-visible then the
  amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
  t_hoff).

 OK, I was wrong here: if you only have that stuff, you can't
 distinguish between a tuple that is visible to everyone and a tuple
 that is visible to no one.  I think the minimal amount of data we need
 in order to distinguish visibility once no relevant transactions are
 in progress is one XID: either XMIN, if the tuple was never updated at
 all or only be the inserting transaction or one of its subxacts; or
 XMAX, if the inserting transaction committed.  The other visibility
 information -- including (1) the other of XMIN and XMAX, (2) CMIN and
 CMAX, and (3) the CTID -- are only interesting the transactions
 involved are no longer running and, if they committed, visible to all
 running transactions.

 Heikki's proposal is basically to merge the 4-byte CID field and the
 first 4 bytes of the CTID that currently store the block number into
 one 8-byte field that can store a pointer into this new TED structure.
 And after mulling it over, that sounds pretty good to me.  It's true
 (as has been pointed out by several people) that the TED will need to
 be persistent because of prepared transactions.  But it would still be
 a big improvement over the status quo, because:

 (1) We would no longer need to freeze MultiXacts.  TED wouldn't need
 to be frozen either.  You'd just truncate it whenever RecentGlobalXmin
 advances.

 (2) If the TED becomes horribly corrupted, you can recover by
 committing or aborting any prepared transactions, shutting the system
 down, and truncating it, with no loss of data integrity.  Nothing in
 the TED is required to determine whether tuples are visible to an
 unrelated transaction - you only need it (a) to determine whether
 tuples are visible to a particular command within a transaction that
 has inserted, updated, or deleted the tuple and (b) determine whether
 tuples are locked.

 (3) As a bonus, we'd eliminate combo CIDs, because the TED could have
 space to separately store CMIN and CMAX.  Combo CIDs required special
 handling for logical decoding, and they are one of the nastier
 barriers to making parallelism support writes (because they are stored
 in backend-local memory of unbounded size and therefore can't easily
 be shared with workers), so it wouldn't be very sad if they went away.

 I'm not quite sure how to decide whether something like this worth (a)
 the work and (b) the risk of creating new bugs, but the more I think
 about it, the more the principal of the thing seems sound to me.


Splitting multitrans into persistent (xmax) and ephemeral (TED) is
something I already proposed so I support the concept; TED is a much better
suggestion, so I support TED.

Your addition of removing combocids is good also, since everything is
public.

I think we need to see a detailed design and we also need to understand the
size of this new beast. I'm worried it might become very big, very quickly
causing problems for us in other ways. We would need to be certain that
truncation can actually occur reasonably frequently and that there are no
edge cases that cause it to bloat.

Though TED sounds nice, the way to avoid another round of on-disk bugs is
by using a new kind of testing, not simply by moving the bits around.

It might be argued that we are increasing the diagnostic/forensic
capabilities by making CIDs more public. We can use that...

The good thing I see from TED is it allows us to test the on-disk outcome
of concurrent activity. Currently we have isolationtester, but that is not
married in any way to the on-disk state allowing us the situation where
isolationtester can pass yet we have corrupted on-disk state. We should
specify the on-disk tuple representation as a state machine and work out
how to recheck the new on-disk state matches the state transition that we
performed.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Multixid hindsight design

2015-06-24 Thread Robert Haas
On Wed, Jun 24, 2015 at 11:30 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Though TED sounds nice, the way to avoid another round of on-disk bugs is by
 using a new kind of testing, not simply by moving the bits around.

I agree that we can do much better at testing than we traditionally
have done, and I think pretty much everyone in the room for the
developer unconference session on testing at PGCon was also in
agreement with that.  I really like the idea of taking purpose-built
testing frameworks - like the one that Heikki created for the WAL
format changes - and polishing them to the point where they can go
into core.  That's more work, of course, but very beneficial.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-24 Thread Simon Riggs
On 24 June 2015 at 16:30, Simon Riggs si...@2ndquadrant.com wrote:


 Though TED sounds nice, the way to avoid another round of on-disk bugs is
 by using a new kind of testing, not simply by moving the bits around.

 It might be argued that we are increasing the diagnostic/forensic
 capabilities by making CIDs more public. We can use that...

 The good thing I see from TED is it allows us to test the on-disk outcome
 of concurrent activity. Currently we have isolationtester, but that is not
 married in any way to the on-disk state allowing us the situation where
 isolationtester can pass yet we have corrupted on-disk state. We should
 specify the on-disk tuple representation as a state machine and work out
 how to recheck the new on-disk state matches the state transition that we
 performed.


To put some more flesh on this idea...

What I'm suggesting is moving from a 2-session isolationtester to a
3-session isolationtester

1. Session 1
2. Session 2
3. After-action confirmation that the planned state change exists correctly
on disk, rather than simply having the correct behaviour

The absence of the third step in our current testing is what has led us to
various bugs in the past (IMHO)

A fourth step would be to define the isolationtests in such a way that we
can run them as burn-in tests for billions of executions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Simon Riggs
On 1 June 2015 at 20:53, Thomas Munro thomas.mu...@enterprisedb.com wrote:

 On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas hlinn...@iki.fi
 wrote:
  The beauty of this would be that the TED entries can be zapped at
 restart,
  just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
  WAL-logged, and we are free to change its on-disk layout even in a minor
  release.

 What about prepared transactions?  They can lock rows FOR SHARE that
 survive server restarts.


Interesting comment. I'm not aware that we do.

If we do support row locking that survives server restart, how did it work
before 9.3?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Andres Freund
On 2015-06-05 10:45:09 +0100, Simon Riggs wrote:
 On 1 June 2015 at 20:53, Thomas Munro thomas.mu...@enterprisedb.com wrote:
 
  On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas hlinn...@iki.fi
  wrote:
   The beauty of this would be that the TED entries can be zapped at
  restart,
   just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
   WAL-logged, and we are free to change its on-disk layout even in a minor
   release.
 
  What about prepared transactions?  They can lock rows FOR SHARE that
  survive server restarts.
 
 
 Interesting comment. I'm not aware that we do.
 
 If we do support row locking that survives server restart, how did it work
 before 9.3?

Multixacts were persistent before 9.3 as well. A good number of the bugs
existed then as well, but their effect was much more limited. The
difference is that now multixacts don't just have to survive till the
last locker isn't running anymore (which was determined by a horizon),
but that they have to live till they're vacuumed away, since xmax might
be stored in the multixact.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Simon Riggs
On 11 May 2015 at 22:20, Heikki Linnakangas hlinn...@iki.fi wrote:

 I'd like to discuss how we should've implemented the infamous 9.3
 multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight is
 always 20/20 - I'll readily admit that I didn't understand the problems
 until well after the release - so this isn't meant to bash what's been
 done. Rather, let's think of the future.

 The main problem with the infamous multixid changes was that it made
 pg_multixact a permanent, critical, piece of data. Without it, you cannot
 decipher whether some rows have been deleted or not. The 9.3 changes
 uncovered pre-existing issues with vacuuming and wraparound, but the fact
 that multixids are now critical turned those the otherwise relatively
 harmless bugs into data loss.

 We have pg_clog, which is a similar critical data structure. That's a pain
 too - you need VACUUM and you can't easily move tables from one cluster to
 another for example - but we've learned to live with it. But we certainly
 don't need any more such data structures.

 So the lesson here is that having a permanent pg_multixact is not nice,
 and we should get rid of it. Here's how to do that:


I think we should think back to exactly what we are trying to store, why
and for how long. A clear definition of what we are trying to achieve is
essential to solving the problem.

In my understanding we need to store
* at most one xid - the Updating Xid
* potentially many Locking Xids

The current design has two SLRUs and puts all of those xids in the Members
SLRU, causing it to need to be persistent.

The problems come from having significant numbers of locking xids. My
understanding is that any change in the number of lockers requires the full
array to be rewritten. So with N lockers we end up with 2N-1 arrays, each
array has an average of N/2 members, or N^2  entries, i.e. an O(N^2)
algorithm, which makes it a bad thing to persist. Assuming that design
continues mostly unchanged in its core points...

An alternate proposal:

1. Store only the Locking xids in the Members SLRU
2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the
Locking xids in the Members SLRU.

This means the Offsets SLRU will be around twice the size it was before BUT
since we reduce the size of each Members array by one, there is a balanced
saving there, so this change is disk-space-neutral.

That way if we need to make Offsets SLRU persistent it won't bloat.
We then leave the Members SLRU as non-persistent, just as it was 9.3

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Simon Riggs
On 5 June 2015 at 11:02, Andres Freund and...@anarazel.de wrote:

 On 2015-06-05 10:45:09 +0100, Simon Riggs wrote:
  On 1 June 2015 at 20:53, Thomas Munro thomas.mu...@enterprisedb.com
 wrote:
 
   On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas hlinn...@iki.fi
   wrote:
The beauty of this would be that the TED entries can be zapped at
   restart,
just like pg_subtrans, and pg_multixact before 9.3. It doesn't need
 to be
WAL-logged, and we are free to change its on-disk layout even in a
 minor
release.
  
   What about prepared transactions?  They can lock rows FOR SHARE that
   survive server restarts.
  
 
  Interesting comment. I'm not aware that we do.
 
  If we do support row locking that survives server restart, how did it
 work
  before 9.3?

 Multixacts were persistent before 9.3 as well. A good number of the bugs
 existed then as well, but their effect was much more limited. The
 difference is that now multixacts don't just have to survive till the
 last locker isn't running anymore (which was determined by a horizon),
 but that they have to live till they're vacuumed away, since xmax might
 be stored in the multixact.


Phew! Had me worried for a minute.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 11 May 2015 at 22:20, Heikki Linnakangas hlinn...@iki.fi wrote:
 So the lesson here is that having a permanent pg_multixact is not nice,
 and we should get rid of it. Here's how to do that:

 An alternate proposal:

 1. Store only the Locking xids in the Members SLRU
 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the
 Locking xids in the Members SLRU.

 This means the Offsets SLRU will be around twice the size it was before BUT
 since we reduce the size of each Members array by one, there is a balanced
 saving there, so this change is disk-space-neutral.

 That way if we need to make Offsets SLRU persistent it won't bloat.
 We then leave the Members SLRU as non-persistent, just as it was 9.3

I don't think you can do that, because it supposes that locking XIDs need
not be remembered across a crash.  Don't prepared transactions break that
assumption?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 6:17 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I think we should think back to exactly what we are trying to store, why and
 for how long. A clear definition of what we are trying to achieve is
 essential to solving the problem.

 In my understanding we need to store
 * at most one xid - the Updating Xid
 * potentially many Locking Xids

 The current design has two SLRUs and puts all of those xids in the Members
 SLRU, causing it to need to be persistent.

 The problems come from having significant numbers of locking xids. My
 understanding is that any change in the number of lockers requires the full
 array to be rewritten. So with N lockers we end up with 2N-1 arrays, each
 array has an average of N/2 members, or N^2  entries, i.e. an O(N^2)
 algorithm, which makes it a bad thing to persist. Assuming that design
 continues mostly unchanged in its core points...

 An alternate proposal:

 1. Store only the Locking xids in the Members SLRU
 2. In the Offsets SLRU store: 1) the Updating Xid and 2) the offset to the
 Locking xids in the Members SLRU.

 This means the Offsets SLRU will be around twice the size it was before BUT
 since we reduce the size of each Members array by one, there is a balanced
 saving there, so this change is disk-space-neutral.

 That way if we need to make Offsets SLRU persistent it won't bloat.
 We then leave the Members SLRU as non-persistent, just as it was 9.3

Hmm, this is a neat idea.  It would have been easier to implement if
we'd thought of it before we released 9.3, though.  At this point, I
guess we'd have to either have a pg_upgrade compatibility break, or
teach pg_upgrade to rejigger the old files into the new file format,
or some other fix that's not immediately apparent to me.  And it also
sounds like a fair amount of work.  But it might be worth it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 10:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 That way if we need to make Offsets SLRU persistent it won't bloat.
 We then leave the Members SLRU as non-persistent, just as it was 9.3

 I don't think you can do that, because it supposes that locking XIDs need
 not be remembered across a crash.  Don't prepared transactions break that
 assumption?

Well, that issue existed before 9.3, too.  It's possible our old
releases weren't entirely correct either, but the big change in 9.3 is
that we have to keep MultiXacts around until they are frozen, rather
than just until their member transactions are no longer running.  If I
understand correctly, Simon's proposal would mean that
pg_multixact/offsets would still need to survive until freezing, but
pg_multixact/members would only need to survive until the member
transactions were no longer running.  That might span a crash or
restart, in the case of prepared transactions, but we could clean up
the member offsets when the prepared transactions were committed,
rather than having to scan every table in the cluster first.  That
only eliminates half the need for multixact vacuuming, but it's
something.

It would be a great deal nicer if we didn't have to keep ANY of the
transactional data for a tuple around once it's all-visible.  Heikki
defined ephemeral as only needed when xmin or xmax is in-progress,
but if we extended that definition slightly to only needed when xmin
or xmax is in-progress or commited but not all-visible then the
amount non-ephemeral data in the tuple header is 5 bytes (infomasks +
t_hoff).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-01 Thread Thomas Munro
On Tue, May 12, 2015 at 9:20 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 The beauty of this would be that the TED entries can be zapped at restart,
 just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
 WAL-logged, and we are free to change its on-disk layout even in a minor
 release.

What about prepared transactions?  They can lock rows FOR SHARE that
survive server restarts.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-06-01 Thread Greg Stark
On Mon, Jun 1, 2015 at 8:53 PM, Thomas Munro
thomas.mu...@enterprisedb.com wrote:
 What about prepared transactions?  They can lock rows FOR SHARE that
 survive server restarts.

And they can make update chains that are still uncommitted after a restart.

I think we should think separately about what information we want to
store in the tuple ideally and only then worry about how to encode it
concisely as an optimization exercise. If you just grow every tuple by
64-bits would this scheme be trivial? Perhaps it would be worth
implementing that way first and then worrying about how to recuperate
that space later?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-05-12 Thread Heikki Linnakangas

On 05/12/2015 01:51 AM, Tom Lane wrote:

Heikki Linnakangas hlinn...@iki.fi writes:

So the lesson here is that having a permanent pg_multixact is not nice,
and we should get rid of it. Here's how to do that:


That would be cool, but ...


Looking at the tuple header, the CID and CTID fields are only needed,
when either xmin or xmax is running. Almost: in a HOT-updated tuple,
CTID is required even after xmax has committed, but since it's a HOT
update, the new tuple is always on the same page so you only need the
offsetnumber part.


I think this is totally wrong.  HOT update or not, you need the forward
link represented by ctid not just until xmin/xmax commit, but until
they're in the past according to all active snapshots.  That's so that
other transactions can chain forward to the current version of a tuple
which they found according to their snapshots.

It might be you can salvage the idea anyway, since it's still true that
the forward links wouldn't be needed after a crash and restart.  But the
argument as stated is wrong.


Ah yes, I stated that wrong. What I meant was that they are not needed 
after xmin and xmax are older than global xmin.



(There's also the question of forensic requirements, although I'm aware
that it's barely worth bringing that up since nobody else here seems to
put much weight on that.)


I do care about that. In this scheme, you would always have the 
updater/deleter XMAX on the tuple itself, which IMO is more useful for 
forensic purposes than a multixid. You lose the CID and CTID in the 
tuple (for tuples that are updated and locked at the same time), but if 
you keep the TED around longer, you have all the information still 
there. On the whole, I don't think this is much worse than the current 
situation.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-05-12 Thread Robert Haas
On Mon, May 11, 2015 at 5:20 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 The main problem with the infamous multixid changes was that it made
 pg_multixact a permanent, critical, piece of data. Without it, you cannot
 decipher whether some rows have been deleted or not. The 9.3 changes
 uncovered pre-existing issues with vacuuming and wraparound, but the fact
 that multixids are now critical turned those the otherwise relatively
 harmless bugs into data loss.

Agreed.

 We have pg_clog, which is a similar critical data structure. That's a pain
 too - you need VACUUM and you can't easily move tables from one cluster to
 another for example - but we've learned to live with it. But we certainly
 don't need any more such data structures.

Yes.

 So the lesson here is that having a permanent pg_multixact is not nice, and
 we should get rid of it. Here's how to do that:

 Looking at the tuple header, the CID and CTID fields are only needed, when
 either xmin or xmax is running. Almost: in a HOT-updated tuple, CTID is
 required even after xmax has committed, but since it's a HOT update, the new
 tuple is always on the same page so you only need the offsetnumber part.
 That leaves us with 8 bytes that are always available for storing
 ephemeral information. By ephemeral, I mean that it is only needed when
 xmin or xmax is in-progress. After that, e.g. after a shutdown, it's never
 looked at.

 Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed by
 a 64-bit pointer, which means that it never wraps around. That 64-bit
 pointer is stored in the tuple header, in those 8 ephemeral bytes currently
 used for CID and CTID. Whenever a tuple is deleted/updated and locked at the
 same time, a TED entry is created for it, in the new SLRU, and the pointer
 to the entry is put on the tuple. In the TED entry, we can use as many bytes
 as we need to store the ephemeral data. It would include the CID (or
 possibly both CMIN and CMAX separately, now that we have the space), CTID,
 and the locking XIDs. The list of locking XIDs could be stored there
 directly, replacing multixids completely, or we could store a multixid
 there, and use the current pg_multixact system to decode them. Or we could
 store the multixact offset in the TED, replacing the multixact offset SLRU,
 but keep the multixact member SLRU as is.

 The XMAX stored on the tuple header would always be a real transaction ID,
 not a multixid. Hence locked-only tuples don't need to be frozen afterwards.

 The beauty of this would be that the TED entries can be zapped at restart,
 just like pg_subtrans, and pg_multixact before 9.3. It doesn't need to be
 WAL-logged, and we are free to change its on-disk layout even in a minor
 release.

 Further optimizations are possible. If the TED entry fits in 8 bytes, it can
 be stored directly in the tuple header. Like today, if a tuple is locked but
 not deleted/updated, you only need to store the locker XID, and you can
 store the locking XID directly on the tuple. Or if it's deleted and locked,
 CTID is not needed, only CID and locker XID, so you can store those direcly
 on the tuple. Plus some spare bits to indicate what is stored. And if the
 XMIN is older than global-xmin, you could also steal the XMIN field for
 storing TED data, making it possible to store 12 bytes directly in the tuple
 header. Plus some spare bits again to indicate that you've done that.

 Now, given where we are, how do we get there? Upgrade is a pain, because
 even if we no longer generate any new multixids, we'll have to be able to
 decode them after pg_upgrade. Perhaps condense pg_multixact into a simpler
 pg_clog-style bitmap at pg_upgrade, to make it small and simple to read, but
 it would nevertheless be a fair amount of code just to deal with pg_upgraded
 databases.

 I think this is worth doing, even after we've fixed all the acute multixid
 bugs, because this would be more robust in the long run. It would also
 remove the need to do anti-wraparound multixid vacuums, and the newly-added
 tuning knobs related to that.

One danger is that in rearranging all of this stuff we may introduce
lots of new bugs.  I do agree that making multixacts need to survive a
server crash was not a good idea.  We liked freezing xmin so much, we
decided to freeze xmax, too?  Uggh.  As painful as that's been,
though, we're 18 months into it at this point.  If we do another
reorganization, are going to end up back at month 0, where pretty much
everybody had corruption all the time rather than only some people on
some workloads?  Maybe not, but it's certainly something to worry
about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Multixid hindsight design

2015-05-11 Thread Heikki Linnakangas
I'd like to discuss how we should've implemented the infamous 9.3 
multixid/row-locking stuff, and perhaps still should in 9.6. Hindsight 
is always 20/20 - I'll readily admit that I didn't understand the 
problems until well after the release - so this isn't meant to bash 
what's been done. Rather, let's think of the future.


The main problem with the infamous multixid changes was that it made 
pg_multixact a permanent, critical, piece of data. Without it, you 
cannot decipher whether some rows have been deleted or not. The 9.3 
changes uncovered pre-existing issues with vacuuming and wraparound, but 
the fact that multixids are now critical turned those the otherwise 
relatively harmless bugs into data loss.


We have pg_clog, which is a similar critical data structure. That's a 
pain too - you need VACUUM and you can't easily move tables from one 
cluster to another for example - but we've learned to live with it. But 
we certainly don't need any more such data structures.


So the lesson here is that having a permanent pg_multixact is not nice, 
and we should get rid of it. Here's how to do that:



Looking at the tuple header, the CID and CTID fields are only needed, 
when either xmin or xmax is running. Almost: in a HOT-updated tuple, 
CTID is required even after xmax has committed, but since it's a HOT 
update, the new tuple is always on the same page so you only need the 
offsetnumber part. That leaves us with 8 bytes that are always available 
for storing ephemeral information. By ephemeral, I mean that it is 
only needed when xmin or xmax is in-progress. After that, e.g. after a 
shutdown, it's never looked at.


Let's add a new SLRU, called Tuple Ephemeral Data (TED). It is addressed 
by a 64-bit pointer, which means that it never wraps around. That 64-bit 
pointer is stored in the tuple header, in those 8 ephemeral bytes 
currently used for CID and CTID. Whenever a tuple is deleted/updated and 
locked at the same time, a TED entry is created for it, in the new SLRU, 
and the pointer to the entry is put on the tuple. In the TED entry, we 
can use as many bytes as we need to store the ephemeral data. It would 
include the CID (or possibly both CMIN and CMAX separately, now that we 
have the space), CTID, and the locking XIDs. The list of locking XIDs 
could be stored there directly, replacing multixids completely, or we 
could store a multixid there, and use the current pg_multixact system to 
decode them. Or we could store the multixact offset in the TED, 
replacing the multixact offset SLRU, but keep the multixact member SLRU 
as is.


The XMAX stored on the tuple header would always be a real transaction 
ID, not a multixid. Hence locked-only tuples don't need to be frozen 
afterwards.


The beauty of this would be that the TED entries can be zapped at 
restart, just like pg_subtrans, and pg_multixact before 9.3. It doesn't 
need to be WAL-logged, and we are free to change its on-disk layout even 
in a minor release.


Further optimizations are possible. If the TED entry fits in 8 bytes, it 
can be stored directly in the tuple header. Like today, if a tuple is 
locked but not deleted/updated, you only need to store the locker XID, 
and you can store the locking XID directly on the tuple. Or if it's 
deleted and locked, CTID is not needed, only CID and locker XID, so you 
can store those direcly on the tuple. Plus some spare bits to indicate 
what is stored. And if the XMIN is older than global-xmin, you could 
also steal the XMIN field for storing TED data, making it possible to 
store 12 bytes directly in the tuple header. Plus some spare bits again 
to indicate that you've done that.



Now, given where we are, how do we get there? Upgrade is a pain, because 
even if we no longer generate any new multixids, we'll have to be able 
to decode them after pg_upgrade. Perhaps condense pg_multixact into a 
simpler pg_clog-style bitmap at pg_upgrade, to make it small and simple 
to read, but it would nevertheless be a fair amount of code just to deal 
with pg_upgraded databases.


I think this is worth doing, even after we've fixed all the acute 
multixid bugs, because this would be more robust in the long run. It 
would also remove the need to do anti-wraparound multixid vacuums, and 
the newly-added tuning knobs related to that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multixid hindsight design

2015-05-11 Thread Tom Lane
Heikki Linnakangas hlinn...@iki.fi writes:
 So the lesson here is that having a permanent pg_multixact is not nice, 
 and we should get rid of it. Here's how to do that:

That would be cool, but ...

 Looking at the tuple header, the CID and CTID fields are only needed, 
 when either xmin or xmax is running. Almost: in a HOT-updated tuple, 
 CTID is required even after xmax has committed, but since it's a HOT 
 update, the new tuple is always on the same page so you only need the 
 offsetnumber part.

I think this is totally wrong.  HOT update or not, you need the forward
link represented by ctid not just until xmin/xmax commit, but until
they're in the past according to all active snapshots.  That's so that
other transactions can chain forward to the current version of a tuple
which they found according to their snapshots.

It might be you can salvage the idea anyway, since it's still true that
the forward links wouldn't be needed after a crash and restart.  But the
argument as stated is wrong.

(There's also the question of forensic requirements, although I'm aware
that it's barely worth bringing that up since nobody else here seems to
put much weight on that.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers