subject:"\[HACKERS\] Logical replication and multimaster"

Re: [HACKERS] Logical replication and multimaster

2015-12-15 Thread Jim Nasby


On 12/13/15 7:37 AM, David Fetter wrote:

As I understand it, pushing these into a library has been proposed but
not rejected.  That it hasn't happened yet is mostly about the lack of
tuits (the round ones) to rewrite the functionality as libraries and
refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
less about writing the code and more about the enormous amount of
testing any such a refactor would entail.


My understanding as well. IIRC Jon Erdman brought this question up a 
couple years ago and the response was "It'd probably be accepted, it's 
just that no one has done the work."



I believe that refactoring much of pg_dump's functionality for the
current version of the server into SQL-accessible functions and making
pg_dump use only those functions is achievable with available
resources.

Such a refactor need not be all-or-nothing.  For example, the
dependency resolution stuff is a first step that appears to be worth
doing by itself even if the effort then pauses, possibly for some
time.


If someone wanted to spend time on this, I suspect it'd be worth looking 
at how bad some of the backward compatibility issues would be if done in 
the server. Maybe they wouldn't be that bad. I suspect the audience for 
this code would be much larger if it was in the server as opposed to a C 
library.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-15 Thread Jon Erdman

On Tue, 15 Dec 2015 21:48:52 -0600
Jim Nasby  wrote:

> On 12/13/15 7:37 AM, David Fetter wrote:
> > As I understand it, pushing these into a library has been proposed but
> > not rejected.  That it hasn't happened yet is mostly about the lack of
> > tuits (the round ones) to rewrite the functionality as libraries and
> > refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
> > less about writing the code and more about the enormous amount of
> > testing any such a refactor would entail.
> 
> My understanding as well. IIRC Jon Erdman brought this question up a 
> couple years ago and the response was "It'd probably be accepted, it's 
> just that no one has done the work."

To be clear, if I understand what you are referring to, my name pops up in this 
discussion because in Amsterdam around the time that FDWs were originally 
launched, I had a "brilliant" idea that a great candidate for an FDW would be 
one that could read from a custom dump (-Fc) file and expose it as though it 
contained regular tables, so you could restore one erroneously deleted row from 
a 2TB dump without loading the whole thing, or the whole table in question.

On the face of it this seemed relatively simple since a custom dump has a TOC 
and all the requisite goodies to make this doable, plus the code exists to 
interpret that (for restoring just one table out of a dump file) and all that 
was needed was the "glue" to hook it into FDW.

Initially the reaction (from Magnus if I'm not mistaken) was "that's stupid, 
who would want that", but later Dave Page was wholly on board with it.

At the next pgcon I spoke up on the same subject at the end of a talk about 
FDWs where Tom was in attendance, and all agreed my idea had merit...however, 
unexpectedly they (including Tom) agreed that trying to turn that part of our 
command line functionality into a library (the proper solution) was more effort 
than it was worth, and that if I wanted to try it I should just cut and paste 
the relevant code out of pg_dump and into my FDW, rather than trying to 
refactor and share said code in a .so. [I was *VERY* surprised by this!]

No one said it couldn't be done, but even the "wise men on the mount" conceded 
that it was such a huge undertaking that it was not worth the effort, and 
duplicating and subsequently maintaining said duplicated code was the better 
part of valor.

> > I believe that refactoring much of pg_dump's functionality for the
> > current version of the server into SQL-accessible functions and making
> > pg_dump use only those functions is achievable with available
> > resources.
> >
> > Such a refactor need not be all-or-nothing.  For example, the
> > dependency resolution stuff is a first step that appears to be worth
> > doing by itself even if the effort then pauses, possibly for some
> > time.
> 
> If someone wanted to spend time on this, I suspect it'd be worth looking 
> at how bad some of the backward compatibility issues would be if done in 
> the server. Maybe they wouldn't be that bad. I suspect the audience for 
> this code would be much larger if it was in the server as opposed to a C 
> library.
> -- 
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Jon Erdman (aka StuckMojo)
PostgreSQL Zealot

pgpfBipAAobAu.pgp
Description: PGP signature

Re: [HACKERS] Logical replication and multimaster

2015-12-14 Thread Konstantin Knizhnik

I have updated DTM page  at 
PoastgreSQL WiKi adding  information about multimaster.
Also we have created repository 
 at github  with our 
version of PostgreSQL and DTM extensions:
multimaster, pg_dtm, pg_tsdtm, bdr (sorry for plagiarism, it is just a 
toy, lightweight version of multimaster with asynchronous replication, 
used  to compare performance).




On 13.12.2015 15:46, Konstantin Knizhnik wrote:

On 12/13/2015 12:19 PM, Simon Riggs wrote:
On 6 December 2015 at 17:39, Konstantin Knizhnik 
> wrote:


I have integrated pglogical_output in multimaster, using
bdr_apply from BDR as template for implementation of receiver part.


I didn't see the patch for this anywhere. Where is the code?


I am sorry,  the code is now in our internal gitlab repository.
We have published pg_dtm and pg_tsdtm as separate repositories at 
github.com/postgrespro.

Them include source of plugin itself and patch to PostgreSQL core.
But we find it is very inconvenient, because we also have to extend 
DTM API, adding new features as deadlock detection...
So we are going to publish at github.com/postgrespro our branch of 
PostgreSQL where pg_dtm, pg_tsdtm and multimaster will be available as 
extensions in contrib directory.  It will be available at Monday.




--
Simon Riggs http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs

On 13 December 2015 at 11:02, Andres Freund  wrote:

> On December 13, 2015 10:19:07 AM CET, Simon Riggs 
> wrote:
> >I didn't see the patch for this anywhere. Where is the code?
>
> Where I'd the code for all of pg logical?
>

Thanks for asking, perhaps our plans weren't public enough. pglogical has
already been announced as open source, under the postgres licence and that
it will be a submission to core PostgreSQL, just as BDR was. pglogical is
in development/test right now and will be released when its ready, which
hopefully will be "soon", aiming for 9.6.

Thanks also for the opportunity to let me ask what your plans are regarding
contributing to core? I couldn't make it to SF recently because of a
funding meeting, but I heard something like your company will release
something as open source sometime in 2016. Could you clarify what that will
be, when it will be, what licence it is under and if it is a contribution
to core also? Is that something you're working on also?

I don't know the status of Konstantin's work, so same question for him also.

It will be useful to help work out what parts need work on once pglogical
is done.

For me, submission to core means both that it is postgres licenced and that
the copyright is novated to PGDG, allowing it to be included within PG core.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Andres Freund

On 2015-12-13 11:39:32 +, Simon Riggs wrote:
> On 13 December 2015 at 11:02, Andres Freund  wrote:
> > On December 13, 2015 10:19:07 AM CET, Simon Riggs 
> > wrote:
> > >I didn't see the patch for this anywhere. Where is the code?
> >
> > Where is the code for all of pg logical?
> >
> 
> Thanks for asking, perhaps our plans weren't public enough. pglogical has
> already been announced as open source, under the postgres licence and that
> it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> in development/test right now and will be released when its ready, which
> hopefully will be "soon", aiming for 9.6.

Well, at the moment not making it public is obviously blocking other
people, and not doing open design discussions publically seems to make
it rather unlikely that it'll get accepted close to as-is anyway. It's
constantly referred to in discussions, and it guided the design of the
submitted output plugin.

> Thanks also for the opportunity to let me ask what your plans are regarding
> contributing to core?

Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
Citus' time. I've been busy with something else for the last 2-3 months.

> I couldn't make it to SF recently because of a funding meeting, but I
> heard something like your company will release something as open
> source sometime in 2016. Could you clarify what that will be, when it
> will be, what licence it is under and if it is a contribution to core
> also? Is that something you're working on also?

I don't see how that belongs to this thread, it's unrelated to
replication.

Either way, the license is yet to be determined, and it'll be Q1
2016. Yes, I've worked on open sourcing it.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Andres Freund

On December 13, 2015 10:19:07 AM CET, Simon Riggs  wrote:
>I didn't see the patch for this anywhere. Where is the code?


Where I'd the code for all of pg logical?

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs

On 6 December 2015 at 17:39, Konstantin Knizhnik 
wrote:


> I have integrated pglogical_output in multimaster, using bdr_apply from
> BDR as template for implementation of receiver part.
>

I didn't see the patch for this anywhere. Where is the code?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Konstantin Knizhnik


On 12/13/2015 12:19 PM, Simon Riggs wrote:

On 6 December 2015 at 17:39, Konstantin Knizhnik > wrote:

I have integrated pglogical_output in multimaster, using bdr_apply from BDR 
as template for implementation of receiver part.


I didn't see the patch for this anywhere. Where is the code?


I am sorry,  the code is now in our internal gitlab repository.
We have published pg_dtm and pg_tsdtm as separate repositories at 
github.com/postgrespro.
Them include source of plugin itself and patch to PostgreSQL core.
But we find it is very inconvenient, because we also have to extend DTM API, 
adding new features as deadlock detection...
So we are going to publish at github.com/postgrespro our branch of PostgreSQL 
where pg_dtm, pg_tsdtm and multimaster will be available as extensions in 
contrib directory.  It will be available at Monday.



--
Simon Riggs http://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread David Fetter

On Sat, Dec 12, 2015 at 06:48:58PM +0800, Craig Ringer wrote:
> Being able to access pg_dump and pg_restore's dependency resolution logic,
> object dumping routines, etc from regular SQL and from the SPI would be
> wonderful.

As I understand it, pushing these into a library has been proposed but
not rejected.  That it hasn't happened yet is mostly about the lack of
tuits (the round ones) to rewrite the functionality as libraries and
refactor pg_dump/pg_restore to use only calls to same.  As usual, it's
less about writing the code and more about the enormous amount of
testing any such a refactor would entail.

> I believe the main complaints about doing that when it was discussed in the
> past were issues with downgrading. A pg_get_tabledef(...) in 9.6 might emit
> keywords etc that a 9.2 server wouldn't understand, and the way we
> currently solve this is to require that you run 9.2's pg_dump against the
> 9.6 server to get a 9.2-compatible dump. So if we had pg_get_blahdef
> functions we'd still need external versions, so why bother having them?

You have made a persuasive case that major version downgrading is not
a problem we need to solve on the first go.

> The alternative is to have all the get_blahdef functions accept a param for
> server version compatibility, which would work but burden future servers
> with knowledge about older versions' features and corresponding code cruft
> for some extended period of time.
> 
> So it's gone nowhere to date.

I believe that refactoring much of pg_dump's functionality for the
current version of the server into SQL-accessible functions and making
pg_dump use only those functions is achievable with available
resources.

Such a refactor need not be all-or-nothing.  For example, the
dependency resolution stuff is a first step that appears to be worth
doing by itself even if the effort then pauses, possibly for some
time.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-13 Thread Simon Riggs

On 13 December 2015 at 11:53, Andres Freund  wrote:

> > Thanks for asking, perhaps our plans weren't public enough. pglogical has
> > already been announced as open source, under the postgres licence and
> that
> > it will be a submission to core PostgreSQL, just as BDR was. pglogical is
> > in development/test right now and will be released when its ready, which
> > hopefully will be "soon", aiming for 9.6.
>
> Well, at the moment not making it public is obviously blocking other
> people, and

What other people are being blocked? What contribution they are making to
PostgreSQL core is being delayed? What is the nature of this delay?

> not doing open design discussions publically seems to make
> it rather unlikely that it'll get accepted close to as-is anyway. It's
> constantly referred to in discussions, and it guided the design of the
> submitted output plugin.

It's a shame you think that, but posting incoherent proposals just wastes
everybody's time.

The focus has been on making the internals more generic, unpicking many of
the parts of BDR that were too specialized. The output plugin has not been
guided by the needs of pglogical at all, its been designed to be way more
open than BDR was.

The UI is a fairly thin layer on top of that and can be recoded without too
much work, but I don't think its a surprising design. It's been quite
difficult to cater for the many complex and sometimes conflicting
requirements and that has only recently come to together into a coherent
form by my hand. Whether the UI makes sense remains to be seen, but I think
building it is an essential part of the evaluation of whether it is
actually a good UI.

Submission to core implies that changes are possible and discussion is
welcome. I expect that to happen. If there were any truly contentious parts
they would have been discussed ahead of time. I see no reason to believe
that pglogical would not or could not be accepted into 9.6.

> > Thanks also for the opportunity to let me ask what your plans are
> regarding
> > contributing to core?
>
> Uh, I am? Stuff like working on upsert, grouping sets et al, was all on
> Citus' time. I've been busy with something else for the last 2-3 months.

Good, thanks; you misread that and I wasn't questioning it. pglogical
developers have a day job too.

> > I couldn't make it to SF recently because of a funding meeting, but I
> > heard something like your company will release something as open
> > source sometime in 2016. Could you clarify what that will be, when it
> > will be, what licence it is under and if it is a contribution to core
> > also? Is that something you're working on also?
>
> I don't see how that belongs to this thread, it's unrelated to
> replication.
>

I assumed your interest in pglogical meant there was some connection.

> Either way, the license is yet to be determined, and it'll be Q1
> 2016. Yes, I've worked on open sourcing it.
>

If its under the Postgres licence and submitted to core, as is BDR, you may
find many people interested in working on it also.

Initial development of major features is IMHO best done by small groups of
dedicated developers. That has got nothing at all to do with what happens
to the code in the longer term.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-12 Thread Craig Ringer

On 12 December 2015 at 02:24, Robert Haas  wrote:

> On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund  wrote:
> > On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
> >> On 10 December 2015 at 03:19, Robert Haas 
> wrote:
> >> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer 
> >> > wrote:
> >> > > * A way to securely make a libpq connection from a bgworker without
> >> > messing
> >> > > with passwords etc. Generate one-time cookies, sometihng like that.
> >> >
> >> > Why would you have the bgworker connect to the database via TCP
> >> > instead of just doing whatever it wants to do directly?
> >
> >> pg_dump and pg_restore, mainly, for copying the initial database state.
> >
> > Well, you don't want to necessarily directly connect from the bgworker,
> > but from processes started from a bgworker. I guess that's where a good
> > bit of the Robert's confusion originated.
>
> That's part of it, yeah.  I'm a little scared of this design.  I mean,
> I understand now why Craig wants to do this (thanks for explaining,
> Craig!), but it seems like it's going to have a lot of the same
> reliability problems that pg_upgrade does.

Yes, and more.

Especially when dealing with multiple upstream servers, etc.

It's not very nice. I would very much prefer to have a better way to
achieve the initial data sync, but at present I don't think there is any
better approach that's even remotely practical.

I'm not saying there's a better way to get the functionality

Yup. That's the problem.

> but it's pretty obvious that depending on tools other than the server
> itself, and in particular
> pg_dump, vastly increases the failure surface area.

 It's not too bad to find pg_dump, though we landed up not being able to
re-use find_other_exec for various reasons I'll have to try to dig out of
the cruftier corners of my memory.  It has a fairly sane interface too.

Things get hairy when you want to do things like "give me all the
upstream's non-table objects, then give me [this set of table
definitions]"... then you go and sync the data from an exported snapshot
using COPY, then finish up by restoring the constraints for the set of
tables you dumped.

Being able to access pg_dump and pg_restore's dependency resolution logic,
object dumping routines, etc from regular SQL and from the SPI would be
wonderful.

I believe the main complaints about doing that when it was discussed in the
past were issues with downgrading. A pg_get_tabledef(...) in 9.6 might emit
keywords etc that a 9.2 server wouldn't understand, and the way we
currently solve this is to require that you run 9.2's pg_dump against the
9.6 server to get a 9.2-compatible dump. So if we had pg_get_blahdef
functions we'd still need external versions, so why bother having them?

The alternative is to have all the get_blahdef functions accept a param for
server version compatibility, which would work but burden future servers
with knowledge about older versions' features and corresponding code cruft
for some extended period of time.

So it's gone nowhere to date.

For that matter it's not clear that pg_get_blahdef functions would be the
right solution, but I can't see directly poking around in the catalogs and
basically re-implementing pg_dump being OK either. So what else could we do?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-11 Thread Craig Ringer

On 10 December 2015 at 03:19, Robert Haas  wrote:

> On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer 
> wrote:
>

> > * A way to securely make a libpq connection from a bgworker without
> messing
> > with passwords etc. Generate one-time cookies, sometihng like that.
>
> Why would you have the bgworker connect to the database via TCP
> instead of just doing whatever it wants to do directly?

pg_dump and pg_restore, mainly, for copying the initial database state.

PostgreSQL doesn't have SQL-level function equivalents, nor
pg_get_tabledef() etc, and there's been strong opposition to adding
anything of the sort when it's been raised before. We could read a dump in
via pg_restore's text conversion and run the appropriate queries over the
SPI, doing the query splitting, COPY parsing and loading, etc ourselves in
a bgworker. It'd be ugly and duplicate a lot, but it'd work. However, it
wouldn't be possible to do restores in parallel that way, and that's
necessary to get good restore performance on big DBs. For that we'd also
basically rewrite pg_restore's parallel functionality using a bgworker
pool.

The alternative is a massive rewrite of pg_dump and pg_restore to allow
them to be used as libraries, and let them use either libpq or the SPI for
queries, presumably via some level of abstraction layer. As well as further
abtraction for pipelining parallel work. Not very practical, and IIRC
whenever library-ifing pg_dump and pg_restore has been discussed before
it's been pretty firmly rejected.

Also, parallelism at apply time. There are two ways to do apply work in
parallel - a pool of bgworkers that each use the SPI, or using regular
backends managing async libpq connections. At this point I think
Konstantin's approach, with a bgworker pool that processes a work queue, is
probably better for this, and want to explore making that a re-usable
extension for 9.5 and possibly a core part of 9.6 or 9.7.

So it's mainly for pg_restore.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-11 Thread Andres Freund

On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
> On 10 December 2015 at 03:19, Robert Haas  wrote:
> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer 
> > wrote:
> > > * A way to securely make a libpq connection from a bgworker without
> > messing
> > > with passwords etc. Generate one-time cookies, sometihng like that.
> >
> > Why would you have the bgworker connect to the database via TCP
> > instead of just doing whatever it wants to do directly?

> pg_dump and pg_restore, mainly, for copying the initial database state.

Well, you don't want to necessarily directly connect from the bgworker,
but from processes started from a bgworker. I guess that's where a good
bit of the Robert's confusion originated.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-11 Thread Petr Jelinek


On 2015-12-11 19:24, Robert Haas wrote:

On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund  wrote:

On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:

On 10 December 2015 at 03:19, Robert Haas  wrote:

On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer 
wrote:

* A way to securely make a libpq connection from a bgworker without

messing

with passwords etc. Generate one-time cookies, sometihng like that.


Why would you have the bgworker connect to the database via TCP
instead of just doing whatever it wants to do directly?



pg_dump and pg_restore, mainly, for copying the initial database state.


Well, you don't want to necessarily directly connect from the bgworker,
but from processes started from a bgworker. I guess that's where a good
bit of the Robert's confusion originated.


That's part of it, yeah.  I'm a little scared of this design.  I mean,
I understand now why Craig wants to do this (thanks for explaining,
Craig!), but it seems like it's going to have a lot of the same
reliability problems that pg_upgrade does.  I'm not saying there's a
better way to get the functionality, but it's pretty obvious that
depending on tools other than the server itself, and in particular
pg_dump, vastly increases the failure surface area.



Well, it's better than trying to write completely new catalogs dump tool 
for this. As Craig said, it would be best if pg_dump functionality was 
moved to functions in a backend, but that's probably not gonna happen 
tomorrow.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-11 Thread Robert Haas

On Fri, Dec 11, 2015 at 5:16 AM, Andres Freund  wrote:
> On 2015-12-11 18:12:55 +0800, Craig Ringer wrote:
>> On 10 December 2015 at 03:19, Robert Haas  wrote:
>> > On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer 
>> > wrote:
>> > > * A way to securely make a libpq connection from a bgworker without
>> > messing
>> > > with passwords etc. Generate one-time cookies, sometihng like that.
>> >
>> > Why would you have the bgworker connect to the database via TCP
>> > instead of just doing whatever it wants to do directly?
>
>> pg_dump and pg_restore, mainly, for copying the initial database state.
>
> Well, you don't want to necessarily directly connect from the bgworker,
> but from processes started from a bgworker. I guess that's where a good
> bit of the Robert's confusion originated.

That's part of it, yeah.  I'm a little scared of this design.  I mean,
I understand now why Craig wants to do this (thanks for explaining,
Craig!), but it seems like it's going to have a lot of the same
reliability problems that pg_upgrade does.  I'm not saying there's a
better way to get the functionality, but it's pretty obvious that
depending on tools other than the server itself, and in particular
pg_dump, vastly increases the failure surface area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-09 Thread Robert Haas

On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer  wrote:
> * An API to enumerate currently registered bgworkers
> * A way to securely make a libpq connection from a bgworker without messing
> with passwords etc. Generate one-time cookies, sometihng like that.
> * (unimportant but nice API fix): BGW_NO_RESTART_ON_CRASH

Why would you have the bgworker connect to the database via TCP
instead of just doing whatever it wants to do directly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-09 Thread Andres Freund

On 2015-12-03 09:54:23 +0300, konstantin knizhnik wrote:
> But right now performance of Multimaster is not limited by logical 
> replication protocol - if I remove DTM and use asynchronous replication 
> (lightweight version of BDR:)
> then I get 38k TPS instead of 12k.

My guess is that that's to a large degree because BDR 'batches' WAL
flushes / fsyncs over several connections. As the data is only applied
in one connection, whereas the primary can use multiple backends, it is
important no to constantly flush, as that's synchronous.

What I did for bdr was to register the 'desired' flush position whenever
replaying a commit (c.f. dlist_push_tail(_lsn_association,
>node); in process_remote_commit()) and whenever feedback is
sent figure out how far the WAL actually has been flushed
(c.f. bdr_get_flush_position()).

Now that cannot trivially be done with 2PC, but it doesn't look all that
hard to change the 2PC API to allow at least some batching of the
fsyncs.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-06 Thread Konstantin Knizhnik


Hi,

I have integrated pglogical_output in multimaster, using bdr_apply from BDR as 
template for implementation of receiver part.
The time of insert is reduced almost 10 times comparing with logical 
replication based on decoder_raw/receiver_raw plugins which performs logical 
replication using SQL statements. But unfortunately time of updates is almost 
not changed.
It is expected result because I didn't see any functions related with SQL 
parsing/preparing in profile.
Now in both cases profile is similar:

  4.62%  postgres[.] HeapTupleSatisfiesMVCC
  2.99%  postgres[.] heapgetpage
  2.10%  postgres[.] hash_search_with_hash_value
  1.86%  postgres[.] ExecProject
  1.80%  postgres[.] heap_getnext
  1.79%  postgres[.] PgXidInMVCCSnapshot

By the way, you asked about comments concerning pglogical_output. I have one: most of 
pglogical protocol functions have "PGLogicalOutputData *data" parameter. There 
are few exceptions:

write_startup_message_fn, pglogical_write_origin_fn, pglogical_write_rel_fn

PGLogicalOutputData is the only way to pass protocol specific data, using 
"PGLogicalProtoAPI *api" field.
This field is assigned by  pglogical_init_api() function. And I can extend this 
PGLogicalProtoAPI structure by adding some protocol specific fields.
For example, this is how it is done now for multimaster:

typedef struct PGLogicalProtoMM
{
PGLogicalProtoAPI api;
bool isLocal; /* mark transaction as local */
} PGLogicalProtoMM;

PGLogicalProtoAPI *
pglogical_init_api(PGLogicalProtoType typ)
{
PGLogicalProtoMM* pmm = palloc0(sizeof(PGLogicalProtoMM));
PGLogicalProtoAPI* res = >api;
pmm->isLocal = false;
res->write_rel = pglogical_write_rel;
res->write_begin = pglogical_write_begin;
res->write_commit = pglogical_write_commit;
res->write_insert = pglogical_write_insert;
res->write_update = pglogical_write_update;
res->write_delete = pglogical_write_delete;
res->write_startup_message = write_startup_message;
return res;
}

But I have to add "PGLogicalOutputData *data"  parameter to 
pglogical_write_rel_fn function.
Di you think that it will be better to pass this parameter to all functions?

May be it is not intended way of passing custom data to this functions...
Certainly it is possible to use static variables for this purpose.
But I think that passing user specific data through PGLogicalOutputData is 
safer and more flexible solution.



On 12/03/2015 04:53 PM, Craig Ringer wrote:

On 3 December 2015 at 19:06, konstantin knizhnik > wrote:


On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:


On 3 December 2015 at 14:54, konstantin knizhnik > wrote:



I'd really like to collaborate using pglogical_output if at all 
possible. Petr's working really hard to get the pglogical downstrem out too, 
with me helping where I can.

And where I can get  pglogical_output plugin? Sorry, but I can't 
quickly find reference with Google...


It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.



Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - 
analog of receiver_raw?


It's pglogical, and it's in progress, due to be released at the same time as 
9.5. We're holding it a little longer to nail down the user interface a bit 
better, etc, and because sometimes the real world gets in the way.

The catalogs  and UI are very different to BDR, it's much more extensible/modular, it supports much more flexible topologies, etc... but lots of the core concepts are very similar. So if you go take a look at the BDR code that'll give you a pretty solid 
idea of how a lot of it works, though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl replication, etc).

--
 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-06 Thread Craig Ringer

On 7 December 2015 at 01:39, Konstantin Knizhnik 
wrote:

>
> I have integrated pglogical_output in multimaster
>

Excellent.

I just pushed a change to pglogical_output that exposes the row contents
(and the rest of the reorder change buffer contents) to hooks that want it,
by the way.

> using bdr_apply from BDR as template for implementation of receiver part.
>

Yep, that'll tide you over. We're working hard on getting the downstream
part ready and you'll find it more flexible.

> The time of insert is reduced almost 10 times comparing with logical
> replication based on decoder_raw/receiver_raw plugins which performs
> logical replication using SQL statements. But unfortunately time of updates
> is almost not changed.
>

That's not too surprising, given that you'll have significant overheads for
checking if keys are present when doing updates.

This field is assigned by  pglogical_init_api() function. And I can extend
> this PGLogicalProtoAPI structure by adding some protocol specific fields.
>

Yep, that's the idea.

> typedef struct PGLogicalProtoMM
> {
> PGLogicalProtoAPI api;
> bool isLocal; /* mark transaction as local */
> } PGLogicalProtoMM;
>

I'm curious about what you're using the 'isLocal' field for.

For MM you should only need to examine the replication origin assigned to
the transaction to determine whether you're going to forward it or not.

Were you not able to achieve what you wanted with a hook? If not, then we
might need another hook. Could you explain what it's for in more detail?

What I suggest is: have your downstream client install a pglogical_output
hook for the transaction filter hook. There, examine the replication origin
passed to the hook. If you want to forward locally originated xacts only
(such as for mesh multimaster) you can just filter out everything where the
origin is not InvalidRepOriginId. There are example hooks in
contrib/pglogical_output_plhooks .

There'll be a simple MM example using filter hooks in the pglogical
downstream btw and we're working hard to get that out.

> But I have to add "PGLogicalOutputData *data"  parameter to
> pglogical_write_rel_fn function.
> Do you think that it will be better to pass this parameter to all
> functions?
>

Yes, I agree that it should be passed to the API for the output protocol.
It's pretty harmless. Please feel free to send a pull req.

Note that we haven't made that pluggable from the outside though; there's
no way to load a new protocol distributed separately from pglogical_output.
The idea is really to make sure that between the binary protocol and json
protocol we meet the reasonably expected set of use cases and don't need
pluggable protocols. Perhaps that's over-optimistic, but we've already got
and output plugin that has plug-in hooks, a plugin for a plugin. Do we need
another? Also, if we allow dynamic loading of new protocols then that means
we'll have a much harder time changing the protocol implementation API
later, so it's not something I'm keen to do. Also, to make it secure to
allow users to specify the protocol we'd have to make protocols implement
an extension with a C function in pg_proc to return its API struct, like we
do for hooks. So there'd be more hoop-jumping required to figure out how to
talk to the client.

If possible I'd like to find any oversights and omissions in the current
protocol and its APIs to meet future use cases without having to introduce
protocol plugins for an output plugin.

May be it is not intended way of passing custom data to this functions...
>

Yeah, we weren't really thinking of the protocol API as intended to be
pluggable and extensible. If you define new protocols you have to change
the rest of the output plugin code anyway.

Lets look at what protocol changes are needed to address your use case and
see whether it's necessary to take the step of making the protocol fully
pluggable or not.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-06 Thread Craig Ringer

> There are definitely two clear places where additional help would be
> useful and welcome right now.
>

Here's a shortlist of replication-related open items I put together
earlier. These are all independent things it'd be helpful to have when
working with logical replication in PostgreSQL. There are a lot of
medium-sized pieces of independent work still to be done to make some
things work well.

For example, you can't have multimaster logical replication with physical
failover nodes for each MM node unless we replicate slot
create/drop/advance over physical replication and copy them with
pg_basebackup.

Handling big transactions better:

* logical decoding interleaved transaction streaming as discussed earlier

Automatic DDL replication:

* DDL deparse extension (Álvaro started some work on this)
* Better way to link the pg_temp_nnn tables generated during table rewrite
to the original table in decoding
* logical decoding support for prepared xacts (before commit prepared).
Useful for DDL replication, other consensus operations.

Using physical replication/PITR with logical replication:

* replication slots replicated to physical standbys ("Failover slots")
* pg_basebackup should copy slots
* slots should follow timeline changes

Failover between logical replicas:

* logical decoding support for sequences
* logical decoding support for slot create/drop/advance
* Separate slot WAL retention from confirmed commit point so you can say
"I've replayed up to 1/AB but you should keep from 1/01". Needed in
async MM to cope with node loss properly. Will write it up separately.

Multimaster:

* Sequence Access Method

Other logical decoding enhancements:

* Support exporting a new snapshot from an existing logical slot. Useful
for COPYing new tables not previously replicated when added to a
replication set, for resync'ing tables, comparing tables, etc.
* WAL messages. Useful for voting, replay confirmation, etc. Rendered
largely unnecessary by xact interleaved streaming.

Misc:

* An API to enumerate currently registered bgworkers
* A way to securely make a libpq connection from a bgworker without messing
with passwords etc. Generate one-time cookies, sometihng like that.
* (unimportant but nice API fix): BGW_NO_RESTART_ON_CRASH



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-04 Thread Craig Ringer

On 3 December 2015 at 20:39, Simon Riggs  wrote:

> On 30 November 2015 at 17:20, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>
>> But looks like there is not so much sense in having multiple network
>> connection between one pair of nodes.
>> It seems to be better to have one connection between nodes, but provide
>> parallel execution of received transactions at destination side. But it
>> seems to be also nontrivial. We have now in PostgreSQL some infrastructure
>> for background works, but there is still no abstraction of workers pool and
>> job queue which can provide simple way to organize parallel execution of
>> some jobs. I wonder if somebody is working now on it or we should try to
>> propose our solution?
>>
>
> There are definitely two clear places where additional help would be
> useful and welcome right now.
>
> 1. Allowing logical decoding to have a "speculative pre-commit data"
> option, to allow some data to be made available via the decoding api,
> allowing data to be transferred prior to commit.
>

Something relevant I ran into re this:

in reorderbuffer.c, on ReorderBufferCommit:

   * We currently can only decode a transaction's contents in when their
commit
   * record is read because that's currently the only place where we know
about
   * cache invalidations. Thus, once a toplevel commit is read, we iterate
over
   * the top and subtransactions (using a k-way merge) and replay the
changes in
   * lsn order.

I haven't dug into the implications particularly as I'm chasing something
else, but want to note it on the thread. Here be dragons when it comes to
transaction streaming.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Simon Riggs

On 30 November 2015 at 17:20, Konstantin Knizhnik  wrote:

> But looks like there is not so much sense in having multiple network
> connection between one pair of nodes.
> It seems to be better to have one connection between nodes, but provide
> parallel execution of received transactions at destination side. But it
> seems to be also nontrivial. We have now in PostgreSQL some infrastructure
> for background works, but there is still no abstraction of workers pool and
> job queue which can provide simple way to organize parallel execution of
> some jobs. I wonder if somebody is working now on it or we should try to
> propose our solution?
>

There are definitely two clear places where additional help would be useful
and welcome right now.

1. Allowing logical decoding to have a "speculative pre-commit data"
option, to allow some data to be made available via the decoding api,
allowing data to be transferred prior to commit. This would allow us to
reduce the delay that occurs at commit, especially for larger transactions
or very low latency requirements for smaller transactions. Some heuristic
or user interface would be required to decide whether to and which
transactions might make their data available prior to commit. And we would
need to send abort messages should the transactions not commit as expected.
That would be a patch on logical decoding and is an essentially separate
feature to anything currently being developed.

2. Some mechanism/theory to decide when/if to allow parallel apply. That
could be used for both physical and logical replication. Since the apply
side of logical replication is still being worked on there is a code
dependency there, so a working solution isn't what is needed yet. But the
general principles and any changes to the data content (wal_level) or
protocol (pglogical_output) would be useful.

We already have working multi-master that has been contributed to PGDG, so
contributing that won't gain us anything. There is a lot of code and
pglogical is the most useful piece of code to be carved off and reworked
for submission. The bottleneck is review and commit, not initial
development - which applies both to this area and most others in PostgreSQL.

Having a single network connection between nodes would increase efficiency
but also increase replication latency, so its not useful in all cases.

I think having some kind of message queue between nodes would also help,
since there are many cases for which we want to transfer data, not just a
replication data flow. For example, consensus on DDL, or MPP query traffic.
But that is open to wider debate.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread konstantin knizhnik


On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:

> On 3 December 2015 at 14:54, konstantin knizhnik  
> wrote:
> 
>> I'd really like to collaborate using pglogical_output if at all possible. 
>> Petr's working really hard to get the pglogical downstrem out too, with me 
>> helping where I can.
> 
> And where I can get  pglogical_output plugin? Sorry, but I can't quickly find 
> reference with Google...
> 
> It's been submitted to this CF.
> 
> https://commitfest.postgresql.org/7/418/
> 
> https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output
> 
> Any tests and comments would be greatly appreciated.
> 

Thank you.
I wonder if there is opposite part of the pipe for pglogical_output - analog of 
receiver_raw?



> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Simon Riggs

On 3 December 2015 at 12:06, konstantin knizhnik 
wrote:

>
> On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:
>
> On 3 December 2015 at 14:54, konstantin knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>> I'd really like to collaborate using pglogical_output if at all possible.
>> Petr's working really hard to get the pglogical downstrem out too, with me
>> helping where I can.
>>
>> And where I can get  pglogical_output plugin? Sorry, but I can't quickly
>> find reference with Google...
>>
>
> It's been submitted to this CF.
>
> https://commitfest.postgresql.org/7/418/
>
> https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output
>
> Any tests and comments would be greatly appreciated.
>
>
> Thank you.
> I wonder if there is opposite part of the pipe for pglogical_output -
> analog of receiver_raw?
>

Yes, there is. pglogical is currently in test and will be available
sometime soon.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Shulgin, Oleksandr

On Thu, Dec 3, 2015 at 8:34 AM, Craig Ringer  wrote:

> On 3 December 2015 at 14:54, konstantin knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>> Are there some principle problems with it? In BDR it was handled in
>> alternative way, using executor callback. It will be much easier if DDL can
>> be replicated in the same way as normal SQL statements.
>>
>
> It can't. I wish it could.
>

That reminds me of that DDL deparsing patch I was trying to revive a while
ago.  Strangely, I cannot find it in any of the commit fests.  Will add it.

--
Alex

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Craig Ringer

On 3 December 2015 at 19:06, konstantin knizhnik 
wrote:

>
> On Dec 3, 2015, at 10:34 AM, Craig Ringer wrote:
>
> On 3 December 2015 at 14:54, konstantin knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>> I'd really like to collaborate using pglogical_output if at all possible.
>> Petr's working really hard to get the pglogical downstrem out too, with me
>> helping where I can.
>>
>> And where I can get  pglogical_output plugin? Sorry, but I can't quickly
>> find reference with Google...
>>
>
> It's been submitted to this CF.
>
> https://commitfest.postgresql.org/7/418/
>
> https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output
>
> Any tests and comments would be greatly appreciated.
>
>
> Thank you.
> I wonder if there is opposite part of the pipe for pglogical_output -
> analog of receiver_raw?
>

It's pglogical, and it's in progress, due to be released at the same time
as 9.5. We're holding it a little longer to nail down the user interface a
bit better, etc, and because sometimes the real world gets in the way.

The catalogs  and UI are very different to BDR, it's much more
extensible/modular, it supports much more flexible topologies, etc... but
lots of the core concepts are very similar. So if you go take a look at the
BDR code that'll give you a pretty solid idea of how a lot of it works,
though BDR has whole subsystems pglogical doesn't (global ddl lock, ddl
replication, etc).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Craig Ringer

On 3 December 2015 at 20:39, Simon Riggs  wrote:

> On 30 November 2015 at 17:20, Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>
>> But looks like there is not so much sense in having multiple network
>> connection between one pair of nodes.
>> It seems to be better to have one connection between nodes, but provide
>> parallel execution of received transactions at destination side. But it
>> seems to be also nontrivial. We have now in PostgreSQL some infrastructure
>> for background works, but there is still no abstraction of workers pool and
>> job queue which can provide simple way to organize parallel execution of
>> some jobs. I wonder if somebody is working now on it or we should try to
>> propose our solution?
>>
>
> There are definitely two clear places where additional help would be
> useful and welcome right now.
>

Three IMO, in that a re-usable, generic bgworker pool driven by shmem
messaging would be quite handy. We'll want something like that when we have
transaction interleaving.

I think Konstantin's design is a bit restrictive at the moment; at the
least it needs to address sticky dispatch, and it almost certainly needs to
be using dynamic bgworkers (and maybe dynamic shmem too) to be flexible.
Some thought will be needed to make sure it doesn't rely on !EXEC_BACKEND
stuff like passing pointers to fork()ed data from postmaster memory too.
But the general idea sounds really useful, and we'll either need that or to
use async libpq for concurrent apply.

> 1. Allowing logical decoding to have a "speculative pre-commit data"
> option, to allow some data to be made available via the decoding api,
> allowing data to be transferred prior to commit.
>

Petr, Andres and I tended to refer to that as interleaved transaction
streaming. The idea being to send changes from multiple xacts mixed
together in the stream, identifed by an xid sent with each message, as we
decode them from WAL. Currently we add them to a local reorder buffer and
send them only in commit order after commit.

This moves responsibility for xact ordering (and buffering, if necessary)
to the downstream. It introduces the possibility that concurrently replayed
xacts could deadlock with each other and a few exciting things like that,
too, but with the payoff that we can continue to apply small transactions
in a timely manner even as we're streaming a big transaction like a COPY.

We could possibly enable interleaving right from the start of the xact, or
only once it crosses a certain size threshold. For your purposes Konstantin
you'd want to do it right from the start since latency is crucial for you.
For pglogical we'd probably want to buffer them a bit and only start
streaming if they got big.

This would allow us to reduce the delay that occurs at commit, especially
> for larger transactions or very low latency requirements for smaller
> transactions. Some heuristic or user interface would be required to decide
> whether to and which transactions might make their data available prior to
> commit.
>

I imagine we'd have a knob, either global or per-slot, that sets a
threshold based on size in bytes of the buffered xact. With 0 allowed as
"start immediately".

> And we would need to send abort messages should the transactions not
> commit as expected. That would be a patch on logical decoding and is an
> essentially separate feature to anything currently being developed.
>

I agree that this is strongly desirable. It'd benefit anyone using logical
decoding and would have wide applications.

> 2. Some mechanism/theory to decide when/if to allow parallel apply.
>

I'm not sure it's as much about allowing it as how to do it.

> We already have working multi-master that has been contributed to PGDG, so
> contributing that won't gain us anything.
>

Namely BDR.

> There is a lot of code and pglogical is the most useful piece of code to
> be carved off and reworked for submission.
>

Starting with the already-published output plugin, with the downstream to
come around the release of 9.5.

> Having a single network connection between nodes would increase efficiency
> but also increase replication latency, so its not useful in all cases.
>

If we interleave messages I'm not sure it's too big a problem. Latency
would only become an issue there if a big single row (big Datum contents)
causes lots of small work to get stuck behind it.

IMO this is a separate issue to be dealt with later.

I think having some kind of message queue between nodes would also help,
> since there are many cases for which we want to transfer data, not just a
> replication data flow. For example, consensus on DDL, or MPP query traffic.
> But that is open to wider debate.
>

Logical decoding doesn't really define any network protocol at all. It's
very flexible, and we can throw almost whatever we want down it. The
pglogical_output protocol is extensible enough that we can just add
additional messages when we need to, making

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Craig Ringer

On 3 December 2015 at 15:27, konstantin knizhnik 
wrote:

>
> On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:
>
> > Excellent.
> >
> > It should be possible to make that a separate extension. You can use C
> functions from other extensions by exposing a single pg_proc function with
> 'internal' return type that populates a struct of function pointers for the
> API. A single DirectFunctionCall lets you get the API struct. That's how
> pglogical_output handles hooks. The main downside is that you can't do that
> without a connection to a database with the extension installed so the
> pg_proc entry is exposed.
>
>
> Actually, working under cluster and columnar storage extension I got
> several questions about PostgreSQL infrastructure.
> I always found some workarounds, but may it is better to ask community
> about it:)
>
> 1. Why there is no "conditional event" synchronization primitive in
> PostgreSQL. There is latch, but it is implemented using sockets and I
> afraid that it is not very fast.
> It will be nice to have some fast primitive like pthread condition
> variables.
>

The need for IPC makes things a bit more complex. Most places can get away
with using a latch, testing one or more conditions, and resuming waiting.

While what you describe sounds possibly nice is there any evidence that
it's a bottleneck or performance issue? Or is this premature optimisation
at work?

> 2. PostgreSQL semaphores seems to be not intended for external use outside
> PostgreSQL core  (for example in extensions).
> There is no way to request additional amount of semaphores. Right now
> semaphores are allocated based on maximal number of backends and spinlocks.
>

Same with spinlocks AFAIK.

You can add your own LWLocks though.

> 3. What is the right way of creation of background worker requiring access
> to shared memory, i.e. having control structure in main memory?
>

This is documented and well established.

> As far as I understand background workers have to be registered either
> PG_init, either outside Postmaster environment.
> If extension requires access to shared memory, then it should be
> registered in shared_preload_libraries list and should be initialized using
> shmem_startup hook.
>

Correct.

You can use dynamic shmem instead, but there are some issues there IIRC.
Petr may have more to say there.

Take a look at the BDR code for some examples, and there are some in
contrib too I think.

My_shmem_startup is needed because in _PG_init it is not possible to
> allocate shared memory.
>

Correct, since it's in early postmaster start.

> So if I need to allocate some control structure for background workers  in
> shared memory, then I should do it in My_shmem_startup.
>

Yes.

> But I can not register background workers in My_shmem_startup!

Correct. Register static bgworkers in _PG_init. Register dynamic bgworkers
later, in a normal backend function or a bgworker main loop.

> So I have to register background workers in PG_init while control
> structure for them is not yet ready.
>

Correct.

They aren't *started* until after shmem init, though.

> When I have implemented pool of background workers, I solved this problem
> by proving function which return address of control structure later - when
> it will be actually allocated.
>

Beware of EXEC_BACKEND. You can't assume you have shared postmaster memory
from fork().

I suggest that you allocate a static shmem array. Pass indexes into it as
the arguments to the bgworkers.  Have them look up their index in the array
to get their struct pointer.

Read the BDR code to see how this can work; see bdr_perdb.c, bdr_apply.c,
etc's bgworker main loops, bdr_supervisor.c and bdr_perdb.c's code for
registering dynamic bgworkers, and the _PG_init function's setup of the
static supervisor bgworker.

In your case I think you should probably be using dynamic bgworkers for
your pool anyway, so you can grow and shrink them as-needed.

But it seems to be some design flaw in BGW, isn' it?
>

I don't think so. You're registering the worker, saying "when you're ready
please start this". You're not starting it.

You can use dynamic bgworkers too. Same deal, you register them and the
postmaster starts them in a little while, but you can register them after
_PG_init.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-03 Thread Petr Jelinek


On 2015-12-03 14:32, Craig Ringer wrote:

On 3 December 2015 at 15:27, konstantin knizhnik
> wrote:

3. What is the right way of creation of background worker requiring
access to shared memory, i.e. having control structure in main memory?


This is documented and well established.

As far as I understand background workers have to be registered
either PG_init, either outside Postmaster environment.
If extension requires access to shared memory, then it should be
registered in shared_preload_libraries list and should be
initialized using shmem_startup hook.


Correct.

You can use dynamic shmem instead, but there are some issues there IIRC.
Petr may have more to say there.
Take a look at the BDR code for some examples, and there are some in
contrib too I think.



If you have your own flock of dynamic workers that you manage yourself, 
it's probably easier to use dynamic shared memory. You can see some 
examples in the tests and also in the parallel query code for how to do 
it. The only real issue we faced with using dynamic shared memory was 
that we needed to do IPC from normal backends and that gets complicated 
when you don't have the worker info in the normal shmem.



The registration timing and working with normal shmem is actually not a 
problem. Just register shmem start hook in _PG_init and if you are 
registering any bgworkers there as well make sure you set bgw_start_time 
correctly (usually what you want is BgWorkerStart_RecoveryFinished). 
Then you'll have the shmem hook called before the bgworker is actually 
started.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Robert Haas

On Mon, Nov 30, 2015 at 11:20 AM, Konstantin Knizhnik
 wrote:
> We have implemented ACID multimaster based on logical replication and our
> DTM (distributed transaction manager) plugin.
> Good news is that it works and no inconsistency is detected.
> But unfortunately it is very very slow...
>
> At standalone PostgreSQL I am able to achieve about 3 TPS with 10
> clients performing simple depbit-credit transactions.
> And with multimaster consisting of three nodes spawned at the same system I
> got about 100 (one hundred) TPS.
> There are two main reasons of such awful performance:
>
> 1. Logical replication serializes all transactions:  there is single
> connection between wal-sender and receiver BGW.
> 2. 2PC synchronizes transaction commit at all nodes.
>
> None of these two reasons are show stoppers themselves.
> If we remove DTM and do asynchronous logical replication then performance of
> multimaster is increased to 6000 TPS
> (please notice that in this test all multimaster node are spawned at the
> same system, sharing its resources,
> so 6k is not bad result comparing with 30k at standalone system).
> And according to 2ndquadrant results, BDR performance is very close to hot
> standby.

Logical decoding only begins decoding a transaction once the
transaction is complete.  So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.

Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot.   If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself.  But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction.  That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions.  Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from.  You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing.  Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here.  If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding.  But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed.  I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Konstantin Knizhnik

Thank you for reply.

On 12/02/2015 08:30 PM, Robert Haas wrote:

Logical decoding only begins decoding a transaction once the
transaction is complete. So I would guess that the sequence of
operations here is something like this - correct me if I'm wrong:

1. Do the transaction.
2. PREPARE.
3. Replay the transaction.
4. PREPARE the replay.
5. COMMIT PREPARED on original machine.
6. COMMIT PREPARED on replica.

Logical decoding is started after execution of XLogFlush method.
So atually transaction is not yet completed at this moment:
- it is not marked as committed in clog
- It is marked as in-progress in procarray
- locks are not released

We are not using PostgreSQL two-phase commit here.
Instead of our DTM catches control in TransactionIdCommitTree and sends request
to arbiter which in turn wait status of committing transactions on replicas.
The problem is that transactions are delivered to replica through single
channel: logical replication slot.
And while such transaction is waiting acknowledgement from arbiter, it is
blocking replication channel preventing other (parallel transactions) from
been replicated and applied.

I have implemented pool of background workers. May be it will be useful not
only for me.
It consists of one produces-multiple consumers queue implemented using buffer
in shared memory, spinlock and two semaphores.
API is very simple:

typedef void(*BgwPoolExecutor)(int id, void* work, size_t size);
typedef BgwPool*(*BgwPoolConstructor)(void);

extern void BgwPoolStart(int nWorkers, BgwPoolConstructor constructor);
extern void BgwPoolInit(BgwPool* pool, BgwPoolExecutor executor, char const*
dbname, size_t queueSize);
extern void BgwPoolExecute(BgwPool* pool, void* work, size_t size);

You just place in this queue some bulk of bytes (work, size), it is placed in
queue and then first available worker will dequeue it and execute.

Using this pool and larger number of accounts (reducing possibility of
conflict), I get better results.
So now receiver of logical replication is not executing transactions directly,
instead of it receiver is placing them in queue and them are executed
concurrent by pool of background workers.

At cluster with three nodes results of out debit-credit benchmark are the
following:

TPS
Multimaster (ACID transactions)
12500
Multimaster (async replication)
34800
Standalone PostgreSQL
44000

We tested two modes: when client randomly distribute queries between cluster nodes and when client is working only with one master nodes and other are just used as replicas. Performance is slightly better in the second case, but the difference is not very
large (about 11000 TPS in first case).

Number of workers in pool has signficant imact on performance: with 8 workers
we get about 7800 TPS and with 16 workers - 12500.
Also performance greatly depends on number of accounts (and so probability of
lock conflicts). In case of 100 accounts speed is less than 1000 TPS.

Step 3 introduces latency proportional to the amount of work the
transaction did, which could be a lot. If you were doing synchronous
physical replication, the replay of the COMMIT record would only need
to wait for the replay of the commit record itself. But with
synchronous logical replication, you've got to wait for the replay of
the entire transaction. That's a major bummer, especially if replay
is single-threaded and there a large number of backends generating
transactions. Of course, the 2PC dance itself can also add latency -
that's most likely to be the issue if the transactions are each very
short.

What I'd suggest is trying to measure where the latency is coming
from. You should be able to measure how much time each transaction
spends (a) executing, (b) preparing itself, (c) waiting for the replay
thread to begin replaying it, (d) waiting for the replay thread to
finish replaying it, and (e) committing. Separating (c) and (d) might
be a little bit tricky, but I bet it's worth putting some effort in,
because the answer is probably important to understanding what sort of
change will help here. If (c) is the problem, you might be able to
get around it by having multiple processes, though that only helps if
applying is slower than decoding. But if (d) is the problem, then the
only solution is probably to begin applying the transaction
speculatively before it's prepared/committed. I think.

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Craig Ringer

On 1 December 2015 at 00:20, Konstantin Knizhnik 
wrote:

> We have implemented ACID multimaster based on logical replication and our
DTM (distributed transaction manager) plugin.

What are you using for an output plugin and for replay?

I'd really like to collaborate using pglogical_output if at all possible.
Petr's working really hard to get the pglogical downstrem out too, with me
helping where I can.

I'd hate to be wasting time and effort working in parallel on overlapping
functionality. I did a LOT of work to make pglogical_output extensible and
reusable for different needs, with hooks used heavily instead of making
things specific to the pglogical downstream. A protocol documented in
detail. A json output mode as an option. Parameters for clients to
negotiate options. etc.

Would a different name for the upstream output plugin help?

> And according to 2ndquadrant results, BDR performance is very close to hot
> standby.
>

Yes... but it's asynchronous multi-master. Very different to what you're
doing.

> I wonder if it is principle limitation of logical replication approach
> which is efficient only for asynchronous replication or it can be somehow
> tuned/extended to efficiently support synchronous replication?
>

I'm certain there are improvements to be made for synchronous replication.

We have also considered alternative approaches:
> 1. Statement based replication.
>

Just don't go there. Really.

> It seems to be better to have one connection between nodes, but provide
> parallel execution of received transactions at destination side.

I agree. This is something I'd like to be able to do through logical
decoding. As far as I can tell there's no fundamental barrier to doing so,
though there are a few limitations when streaming logical xacts:

- We can't avoid sending transactions that get rolled back

- We can't send the commit timestamp, commit LSN, etc at BEGIN time, so
last-update-wins
  conflict resolution can't be done based on commit timestamp

- When streaming, the xid must be in each message, not just in begin/commit.

- The apply process can't use the SPI to apply changes directly since we
can't multiplex transactions. It'll need to use
  shmem to communicate with a pool of workers, dispatching messages to
workers as they arrive. Or it can multiplex
  a set of libpq connections in async mode, which I suspect may prove to be
better.

I've made provision for streaming support in the pglogical_output
extension. It'll need core changes to allow logical decoding to stream
changes though.

Separately, I'd also like to look at decoding and sending sequence
advances, which are something that happens outside transaction boundaries.

> We have now in PostgreSQL some infrastructure for background works, but
> there is still no abstraction of workers pool and job queue which can
> provide simple way to organize parallel execution of some jobs. I wonder if
> somebody is working now on it or we should try to propose our solution?
>

I think a worker pool would be quite useful to have.

For BDR and for pglogical we had to build an infrastructure on top of
static and dynamic bgworkers. A static worker launches a dynamic bgworker
for each database. The dynamic bgworker for the database looks at
extension-provided user catalogs to determine whether it should launch more
dynamic bgworkers for each connection to a peer node.

Because the bgworker argument is a single by-value Datum the argument
passed is an index into a static shmem array of structs. The struct is
populated with the target database oid (or name, for 9.4, due to bgworker
API limitations) and other info needed to start the worker.

Because registered static and dynamic bgworkers get restarted by the
postmaster after a crash/restart cycle, and the restarted static worker
will register new dynamic workers after restart, we have to jump through
some annoying hoops to avoid duplicate bgworkers. A generation counter is
stored in postmaster memory and incremented on crash recovery then copied
to shmem. The high bits of the Datum argument to the workers embeds the
generation counter. They compare their argument's counter to the one in
shmem and exit if the counter differs, so the relaunched old generation of
workers exits after a crash/restart cycle. See the thread on
BGW_NO_RESTART_ON_CRASH for details.

In pglogical we're instead using BGW_NEVER_RESTART workers and doing
restarts ourselves when needed, ignoring the postmaster's ability to
restart bgworkers when the worker crashes.

It's likely that most projects using bgworkers for this sort of thing will
need similar functionality, so generalizing it into a worker pool API makes
a lot of sense. In the process we could really use API to examine currently
registered and running bgworkers. Interested in collaborating on that?

Another thing I've wanted as part of this work is a way to get a one-time
authentication cookie from the server that can be passed as a

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Craig Ringer

On 3 December 2015 at 04:18, Konstantin Knizhnik 
wrote:

> The problem is that transactions are delivered to replica through single
> channel: logical replication slot.
> And while such transaction is waiting acknowledgement from arbiter, it is
> blocking replication channel preventing other (parallel transactions)  from
> been replicated and applied.
>

Streaming interleaved xacts from the slot as discussed in the prior mail
would help there.

You'd continue to apply concurrent work from other xacts, and just handle
commit messages as they arrive, sending the confirmations back through the
DTM API.

> I have implemented pool of background workers. May be it will be useful
> not only for me.
>

Excellent.

It should be possible to make that a separate extension. You can use C
functions from other extensions by exposing a single pg_proc function with
'internal' return type that populates a struct of function pointers for the
API. A single DirectFunctionCall lets you get the API struct. That's how
pglogical_output handles hooks. The main downside is that you can't do that
without a connection to a database with the extension installed so the
pg_proc entry is exposed.

So it could make more sense to just keep it as a separate .c / .h file
that's copied into trees that use it. Simpler and easier, but uglier.

> It consists of one produces-multiple consumers queue implemented using
> buffer in shared memory, spinlock and two semaphores.
>
[snip]

> You just place in this queue some bulk of bytes (work, size), it is placed
> in queue and then first available worker will dequeue it and execute.
>

Very nice.

To handle xact streaming  I think you're likely to need a worker dispatch
key too, where the dispatch keys are "sticky" to a given worker. So you
assign xid 1231 to a worker at BEGIN. Send all work to the pool and
everything with xid 1231 goes to that worker. At commit you clear the
assignment of xis 1231.

Alternately a variant of the Execute method that lets you dispatch to a
specific worker would do the job.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread konstantin knizhnik


On Dec 3, 2015, at 4:09 AM, Craig Ringer wrote:

> On 1 December 2015 at 00:20, Konstantin Knizhnik  
> wrote:
> 
> > We have implemented ACID multimaster based on logical replication and our 
> > DTM (distributed transaction manager) plugin.
> 
> What are you using for an output plugin and for replay?

I have implemented output plugin for multimaster based on Michael's 
decoder_raw+receiver_raw.
Right now it decodes WAL into correspondent SQL insert/update statements.
Certainly it is very inefficient way and in future I will replace it with some 
binary protocol, as it is used for example in BDR
(but BDR plugin contains a lot of stuff related with detecting and handling 
conflicts which is not relevant for multimaster).
But right now performance of Multimaster is not limited by logical replication 
protocol - if I remove DTM and use asynchronous replication (lightweight 
version of BDR:)
then I get 38k TPS instead of 12k.


> 
> I'd really like to collaborate using pglogical_output if at all possible. 
> Petr's working really hard to get the pglogical downstrem out too, with me 
> helping where I can.
> 
> I'd hate to be wasting time and effort working in parallel on overlapping 
> functionality. I did a LOT of work to make pglogical_output extensible and 
> reusable for different needs, with hooks used heavily instead of making 
> things specific to the pglogical downstream. A protocol documented in detail. 
> A json output mode as an option. Parameters for clients to negotiate options. 
> etc.
> 
> Would a different name for the upstream output plugin help?


And where I can get  pglogical_output plugin? Sorry, but I can't quickly find 
reference with Google...
Also I wonder if this plugin perform DDL replication (most likely not). But 
then naive question - why DDL was excluded from logical replication protocol?
Are there some principle problems with it? In BDR it was handled in alternative 
way, using executor callback. It will be much easier if DDL can be replicated 
in the same way as normal SQL statements.


>  
> And according to 2ndquadrant results, BDR performance is very close to hot 
> standby.
> 
> Yes... but it's asynchronous multi-master. Very different to what you're 
> doing.
>  
> I wonder if it is principle limitation of logical replication approach which 
> is efficient only for asynchronous replication or it can be somehow 
> tuned/extended to efficiently support synchronous replication?
> 
> I'm certain there are improvements to be made for synchronous replication.
> 
> We have also considered alternative approaches:
> 1. Statement based replication.
> 
> Just don't go there. Really.
>  
> It seems to be better to have one connection between nodes, but provide 
> parallel execution of received transactions at destination side.
> 
> I agree. This is something I'd like to be able to do through logical 
> decoding. As far as I can tell there's no fundamental barrier to doing so, 
> though there are a few limitations when streaming logical xacts:
> 
> - We can't avoid sending transactions that get rolled back
> 
> - We can't send the commit timestamp, commit LSN, etc at BEGIN time, so 
> last-update-wins
>   conflict resolution can't be done based on commit timestamp
> 
> - When streaming, the xid must be in each message, not just in begin/commit.
> 
> - The apply process can't use the SPI to apply changes directly since we 
> can't multiplex transactions. It'll need to use
>   shmem to communicate with a pool of workers, dispatching messages to 
> workers as they arrive. Or it can multiplex
>   a set of libpq connections in async mode, which I suspect may prove to be 
> better.
> 
> I've made provision for streaming support in the pglogical_output extension. 
> It'll need core changes to allow logical decoding to stream changes though.
> 
> Separately, I'd also like to look at decoding and sending sequence advances, 
> which are something that happens outside transaction boundaries.
> 
>  
> We have now in PostgreSQL some infrastructure for background works, but there 
> is still no abstraction of workers pool and job queue which can provide 
> simple way to organize parallel execution of some jobs. I wonder if somebody 
> is working now on it or we should try to propose our solution?
> 
> I think a worker pool would be quite useful to have.
> 
> For BDR and for pglogical we had to build an infrastructure on top of static 
> and dynamic bgworkers. A static worker launches a dynamic bgworker for each 
> database. The dynamic bgworker for the database looks at extension-provided 
> user catalogs to determine whether it should launch more dynamic bgworkers 
> for each connection to a peer node.
> 
> Because the bgworker argument is a single by-value Datum the argument passed 
> is an index into a static shmem array of structs. The struct is populated 
> with the target database oid (or name, for 9.4, due to bgworker API 
> limitations) and other info needed to

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Craig Ringer

On 3 December 2015 at 14:54, konstantin knizhnik 
wrote:

>
> I'd really like to collaborate using pglogical_output if at all possible.
> Petr's working really hard to get the pglogical downstrem out too, with me
> helping where I can.
>
> And where I can get  pglogical_output plugin? Sorry, but I can't quickly
> find reference with Google...
>

It's been submitted to this CF.

https://commitfest.postgresql.org/7/418/

https://github.com/2ndQuadrant/postgres/tree/dev/pglogical-output

Any tests and comments would be greatly appreciated.

I have a version compatible with 9.4 and older in a separate tree I want to
make public. I'll get back to you on that later today. It's  the same code
with a few more ifdefs and an uglier structure for the example hooks module
(because it can be a separate contrib)¸so it's not that exciting.

You should be able to just "git remote add" that repo, "git fetch" and "git
merge dev/pglogical-output" into your working tree.

> Also I wonder if this plugin perform DDL replication (most likely not).
>

No, it doesn't. The way it's done in BDR is too intrusive and has to be
reworked before it can be made more generally re-usable.

How I envision DDL replication working for pglogical (or anything else) is
to take the DDL hooks added in 9.5 and use them with a separate DDL deparse
extension based on Álvaro's deparse work. If you want to replicate DDL you
make sure this extension is loaded then use it from your event triggers to
capture DDL in a useful form and write it to a queue table where your
downstream client can find it and consume it. That way the deparse code
doesn't have to be embedded in the Pg backend like it is in BDR, and
instead can be a reusable extension.

But then naive question - why DDL was excluded from logical replication
> protocol?
>

logical decoding can't handle DDL because all it sees is the effects of
that DDL in the xlog as a series of changes to catalog tables, relfilenode
changes, etc. It can't turn that back into the original DDL in any kind of
reliable way. A downstream can't do very much with "rename relfilenode 1231
to 1241".

There are a few cases we might want to handle through decoding - in
particular I'd like to be able to decode changes to rows in shared catalogs
like pg_authid, since we can't handle that with DDL deparse. For things
like DROP TABLE, CREATE TABLE, etc we really need DDL hooks. At least as I
currently understand things.

So we try to capture DDL at a higher level. That's why event triggers were
added (http://www.postgresql.org/docs/current/static/event-triggers.html)
and why DDL deparse was implemented (
https://commitfest-old.postgresql.org/action/patch_view?id=1610).

You can't just capture the raw DDL statement since there are issues with
search_path normalization, etc. Similar problems to statement based
replication exist. Deparse is required to get the DDL after it's converted
to a utility statement so we can obtain it in an unambiguous form.

I'll add some explanation in pglogical_output's DESIGN.md for why DDL is
not currently handled.

BTW, TRUNCATE _is_ handled by the way. In pglogical we use regular TRUNCATE
triggers (marked tgisinternal) for that. There are some significant
complexities around foreign keys, sequence reset, etc, which are not fully
handled yet.

Are there some principle problems with it? In BDR it was handled in
> alternative way, using executor callback. It will be much easier if DDL can
> be replicated in the same way as normal SQL statements.
>

It can't. I wish it could.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread konstantin knizhnik

On Dec 3, 2015, at 4:18 AM, Craig Ringer wrote:

> Excellent.
> 
> It should be possible to make that a separate extension. You can use C 
> functions from other extensions by exposing a single pg_proc function with 
> 'internal' return type that populates a struct of function pointers for the 
> API. A single DirectFunctionCall lets you get the API struct. That's how 
> pglogical_output handles hooks. The main downside is that you can't do that 
> without a connection to a database with the extension installed so the 
> pg_proc entry is exposed.

Actually, working under cluster and columnar storage extension I got several 
questions about PostgreSQL infrastructure.
I always found some workarounds, but may it is better to ask community about 
it:)

1. Why there is no "conditional event" synchronization primitive in PostgreSQL. 
There is latch, but it is implemented using sockets and I afraid that it is not 
very fast.
It will be nice to have some fast primitive like pthread condition variables.

2. PostgreSQL semaphores seems to be not intended for external use outside 
PostgreSQL core  (for example in extensions).
There is no way to request additional amount of semaphores. Right now 
semaphores are allocated based on maximal number of backends and spinlocks.
And a semaphore as well as event is very popular and convenient synchronization 
primitive required in many cases.

3. What is the right way of creation of background worker requiring access to 
shared memory, i.e. having control structure in main memory?
As far as I understand background workers have to be registered either PG_init, 
either outside Postmaster environment.
If extension requires access to shared memory, then it should be registered in 
shared_preload_libraries list and should be initialized using shmem_startup 
hook.
Something like this:

void _PG_init(void)
{
if (!process_shared_preload_libraries_in_progress)
return;
...
prev_shmem_startup_hook = shmem_startup_hook;
shmem_startup_hook = My_shmem_startup;
}

My_shmem_startup is needed because in _PG_init it is not possible to allocate 
shared memory.
So if I need to allocate some control structure for background workers  in 
shared memory, then I should do it in My_shmem_startup.
But I can not register background workers in My_shmem_startup! I will get "must 
be registered in shared_preload_libraries" error:

void
RegisterBackgroundWorker(BackgroundWorker *worker)
{
if (!process_shared_preload_libraries_in_progress)
{
if (!IsUnderPostmaster)
ereport(LOG,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("background worker \"%s\": must 
be registered in shared_preload_libraries",
worker->bgw_name)));
return;
}
}

So I have to register background workers in PG_init while control structure for 
them is not yet ready.
When I have implemented pool of background workers, I solved this problem by 
proving function which return address of control structure later - when it will 
be actually allocated.
But it seems to be some design flaw in BGW, isn' it?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-02 Thread Craig Ringer

On 3 December 2015 at 01:30, Robert Haas  wrote:

> 1. Do the transaction.
> 2. PREPARE.
> 3. Replay the transaction.
>

As Konstantin noted they aren't using Pg's 2PC. They actually couldn't if
they wanted to because logical decoding does not support decoding an xact
at PREPARE TRANSACTION time, without COMMIT PREPARED.

I'd love to change that and allow decoding at PREPARE TRANSACTION time - or
streaming the xact from the start, as discussed in the prior mail. This
would be a huge help for doing consensus operations on an otherwise
asynchronous cluster, like making table structure changes. You'd decode the
prepared xact, replay it, prepare it on all nodes, then commit prepared
when all nodes confirm successful prepare.

IIRC the main issue with this is that the prepared xact continues to hold
locks so logical decoding can't acquire the locks it needs to decode the
xact.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

[HACKERS] Logical replication and multimaster

2015-11-30 Thread Konstantin Knizhnik


Hello all,

We have implemented ACID multimaster based on logical replication and 
our DTM (distributed transaction manager) plugin.

Good news is that it works and no inconsistency is detected.
But unfortunately it is very very slow...

At standalone PostgreSQL I am able to achieve about 3 TPS with 10 
clients performing simple depbit-credit transactions.
And with multimaster consisting of three nodes spawned at the same 
system I got about 100 (one hundred) TPS.

There are two main reasons of such awful performance:

1. Logical replication serializes all transactions:  there is single 
connection between wal-sender and receiver BGW.

2. 2PC synchronizes transaction commit at all nodes.

None of these two reasons are show stoppers themselves.
If we remove DTM and do asynchronous logical replication then 
performance of multimaster is increased to 6000 TPS
(please notice that in this test all multimaster node are spawned at the 
same system, sharing its resources,

so 6k is not bad result comparing with 30k at standalone system).
And according to 2ndquadrant results, BDR performance is very close to 
hot standby.


On the other hand our previous experiments with DTM shows only about 2 
times slowdown comparing with vanilla PostgreSQL.

But result of combining DTM and logical replication is frustrating.

I wonder if it is principle limitation of logical replication approach 
which is efficient only for asynchronous replication or it can be 
somehow tuned/extended to efficiently support synchronous replication?


We have also considered alternative approaches:
1. Statement based replication.
2. Trigger-based replication.
3. Replication using custom nodes.

In case of statement based replication it is hard to guarantee identity 
of of data at different nodes.
Approaches 2 and 3 are much harder to implement and requiring to 
"reinvent" substantial part of logical replication.
Them also require some kind of connection pool which can be used to send 
replicated transactions to the peer nodes (to avoid serialization of 
parallel transactions as in case of logical replication).


But looks like there is not so much sense in having multiple network 
connection between one pair of nodes.
It seems to be better to have one connection between nodes, but provide 
parallel execution of received transactions at destination side. But it 
seems to be also nontrivial. We have now in PostgreSQL some 
infrastructure for background works, but there is still no abstraction 
of workers pool and job queue which can provide simple way to organize 
parallel execution of some jobs. I wonder if somebody is working now on 
it or we should try to propose our solution?


Best regards,
Konstantin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

38 matches

Mail list logo