Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Andres Freund
On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
 Boszormenyi Zoltan z...@cybertec.at wrote:
  we tried to discuss on a lower level what should be needed
  for a partial replication based on streaming replication.
 We need to discuss a partial recovery before the partial replication.
If you do the filtering on the sending side you dont actually need partial 
recover in the sense that you filter in the rmgr or similar.

Or do I miss something?

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Itagaki Takahiro

Andres Freund and...@anarazel.de wrote:

  We need to discuss a partial recovery before the partial replication.
 If you do the filtering on the sending side you dont actually need partial 
 recover in the sense that you filter in the rmgr or similar.
 
 Or do I miss something?

Sorry, I didn't explain well.

I just suggested the order of development. I think paritial recovery
is easier than partition replication because we don't need to think
network nor multiple clients in recovery.

Also, this feature can be developed on the Streming Replication, but
it is under development -- the code is not fixed. So, I'd suggest to
start the development from independent parts from Streaming Replication.

I believe the partial replication will require modules developed
for the partial recovery in the future.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Stefan Kaltenbrunner

Andres Freund wrote:

On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:

Boszormenyi Zoltan z...@cybertec.at wrote:

we tried to discuss on a lower level what should be needed
for a partial replication based on streaming replication.

We need to discuss a partial recovery before the partial replication.
If you do the filtering on the sending side you dont actually need partial 
recover in the sense that you filter in the rmgr or similar.


Or do I miss something?


the question is if filtering on the sending side is actually the right 
thing to do.
It increases the overhead and the complexity on the master, especially 
if you think about different (partial) replication agreements for 
different slaves and it might also be hard to integrate with the planned 
sync/async modes.
On the other hand if you filter on the master you might be able to avoid 
a lot of network traffic du to filtered wal records.
I think for a first step it might make more sense to look into doing the 
filtering on the receiving side and look into actual integration with SR 
at a later stage.



Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Andres Freund
On Monday 30 November 2009 10:32:50 Stefan Kaltenbrunner wrote:
 Andres Freund wrote:
  On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:
  Boszormenyi Zoltan z...@cybertec.at wrote:
  we tried to discuss on a lower level what should be needed
  for a partial replication based on streaming replication.
 
  We need to discuss a partial recovery before the partial replication.
 
  If you do the filtering on the sending side you dont actually need
  partial recover in the sense that you filter in the rmgr or similar.
 
  Or do I miss something?
 
 the question is if filtering on the sending side is actually the right
 thing to do.
 It increases the overhead and the complexity on the master, especially
 if you think about different (partial) replication agreements for
 different slaves and it might also be hard to integrate with the planned
 sync/async modes.
 On the other hand if you filter on the master you might be able to avoid
 a lot of network traffic du to filtered wal records.
 I think for a first step it might make more sense to look into doing the
 filtering on the receiving side and look into actual integration with SR
 at a later stage.
I think filtering on the receiving side is harder by many degrees because you 
don't have an up 2 date copy of the catalog. I cant think of a design that 
does not impose severe constraints on catalog and especially replication 
settings to implement on the receiving side.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Robert Haas
On Nov 30, 2009, at 1:55 AM, Craig Ringer  
cr...@postnewspapers.com.au wrote:



Boszormenyi Zoltan wrote:


c. splitting wal into different replication sets


Just a side note: in addition to its use for partial replication, this
might have potential for performance-prioritizing databases or  
tablespaces.


Being able to separate WAL logging so that different DBs, tablespaces,
etc went to different sets of WAL logs would allow a DBA to give some
databases or tablespaces dedicated WAL logging space on faster  
storage.
If partial recovery is implemented, it might also permit less  
important
databases to be logged to fast-but-unsafe storage such as a non-BBU  
disk

controller with write cache enabled, without putting more important
databases in the same cluster in danger.


The danger here is that if we make crash recovery more complex, we'll  
introduce subtle bugs that will only be discovered after someone's  
data is toast.


...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Tom Lane
Craig Ringer cr...@postnewspapers.com.au writes:
 Just a side note: in addition to its use for partial replication, this
 might have potential for performance-prioritizing databases or tablespaces.

 Being able to separate WAL logging so that different DBs, tablespaces,
 etc went to different sets of WAL logs would allow a DBA to give some
 databases or tablespaces dedicated WAL logging space on faster storage.

I don't think this can possibly work without introducing data corruption
issues.  What happens when a transaction touches tables in different
tablespaces?  You can't apply the changes out-of-order.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Hans-Jürgen Schönig


On Nov 30, 2009, at 10:32 AM, Stefan Kaltenbrunner wrote:


Andres Freund wrote:

On Monday 30 November 2009 03:57:11 Itagaki Takahiro wrote:

Boszormenyi Zoltan z...@cybertec.at wrote:

we tried to discuss on a lower level what should be needed
for a partial replication based on streaming replication.
We need to discuss a partial recovery before the partial  
replication.
If you do the filtering on the sending side you dont actually need  
partial recover in the sense that you filter in the rmgr or similar.

Or do I miss something?


the question is if filtering on the sending side is actually the  
right thing to do.
It increases the overhead and the complexity on the master,  
especially if you think about different (partial) replication  
agreements for different slaves and it might also be hard to  
integrate with the planned sync/async modes.
On the other hand if you filter on the master you might be able to  
avoid a lot of network traffic du to filtered wal records.
I think for a first step it might make more sense to look into doing  
the filtering on the receiving side and look into actual integration  
with SR at a later stage.



Stefan



hello ...

one problem with not-filtering on the master is that you will end up  
with a lot of complexity if you start adding new tables to a replica  
because you just cannot add tables as easy as when you are doing stuff  
on the slave. the procedure seems ways more complex.
in addition to that you are sending WAL which has to be discarded  
anyway.
we thought about filtering outside the master a lot but to me it did  
not sound like good plan.


regards,

hans


--
Cybertec Schönig  Schönig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Hans-Jürgen Schönig




Just a side note: in addition to its use for partial replication, this
might have potential for performance-prioritizing databases or  
tablespaces.




hello ...

this is an absolutely non-starter. the WAL is designed to be hyper  
ordered and hyper critical. once you fuck up order you will end up  
with a total disaster. WAL has to be applied in perfect order without  
skipping depending objects and so on. any concept which tries to get  
around those fundamental law is either broken.


hans


--
Cybertec Schönig  Schönig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Andres Freund
On Monday 30 November 2009 17:46:45 Hans-Jürgen Schönig wrote:
 On Nov 30, 2009, at 10:32 AM, Stefan Kaltenbrunner wrote:
  the question is if filtering on the sending side is actually the
  right thing to do.
  It increases the overhead and the complexity on the master,
  especially if you think about different (partial) replication
  agreements for different slaves and it might also be hard to
  integrate with the planned sync/async modes.
  On the other hand if you filter on the master you might be able to
  avoid a lot of network traffic du to filtered wal records.
  I think for a first step it might make more sense to look into doing
  the filtering on the receiving side and look into actual integration
  with SR at a later stage.
 one problem with not-filtering on the master is that you will end up
 with a lot of complexity if you start adding new tables to a replica
 because you just cannot add tables as easy as when you are doing stuff
 on the slave. the procedure seems ways more complex.
 in addition to that you are sending WAL which has to be discarded
 anyway.
 we thought about filtering outside the master a lot but to me it did
 not sound like good plan.
One possibility for the far future would be to allow filtering on a slave as 
well:

master  full replication --- primary slave --- split --- slaves

Possibly doing only catalog recovery on the primary slave. In my opinion thats 
heaps more complex and not better in all situation. So I would probably write 
it down as a nice idea but not more.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-30 Thread Craig Ringer

On 30/11/2009 11:07 PM, Tom Lane wrote:

Craig Ringercr...@postnewspapers.com.au  writes:

Just a side note: in addition to its use for partial replication, this
might have potential for performance-prioritizing databases or tablespaces.



Being able to separate WAL logging so that different DBs, tablespaces,
etc went to different sets of WAL logs would allow a DBA to give some
databases or tablespaces dedicated WAL logging space on faster storage.


I don't think this can possibly work without introducing data corruption
issues.  What happens when a transaction touches tables in different
tablespaces?  You can't apply the changes out-of-order.


Argh, good point, and one that should've been blindingly obvious.

At a database level something like that may still be handy, though I 
haven't the foggiest how one would handle the shared system catalogs.


--
Craig Ringer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-29 Thread Boszormenyi Zoltan
Hi,

we tried to discuss on a lower level what should be needed
for a partial replication based on streaming replication.

a. transferring base data after a slave got added a relation/index/...
(and initial setup)
b. communicating the the slaves which relations they currently should
have available
c. splitting wal into different replication sets
d. some configuration frontend. Possibly directly via sql or via functions

I don't think its reasonable trying to discuss and implement this all
in one huge patch so I propose implementing at least 1) as a seperate
patch.

a) is very useful outside the context of this specific feature and
kind of a requisite so i suggest tackling that first.

Plan:
M: checkpoint, full page writes, access share lock on the relation
S: stop receiving wal
SM: Using $method transfer required base data for every required segment/fork
M: normal writes
S: restart wal replay

Do you see any fundamental problem with this?

Now there unforatunately are two different schools of thought how to implement 
transfering base data.
1. archive_command like transfer command taking a source path/host and target 
path/host
   - very flexible (think e.g. not transferring the data for multiple slaves 
over the whole country)
   - harder to setup
   - more in style of classical wal archiving
2. add the capability to the WAL Streaming patch's libpq based protocol
   - no additional configuration needed
   - inflexible
   - makes usage from non streaming replication is impossible

I favor 1. but only lightly so.

Detail Questions:
- How to deal with multiple transfer requests at the same time?
  There would be a need for multiple full backup requests for
  individual tables by several clients at once.
  Currently pg_start_backup() isn't allowed from
  two clients in parallel, the second one gets an error.
  We thought that pg_start_backup() and pg_stop_backup()
  can turn into simple reference counts. IIRC, WALs
  are still generated and _shipped to slaves_ during
  a full backup, they are simply not yet applied to
  base table files. So, in this case a pg_stop_backup()
  issued from a slave decreases refcount of the base backups
  and the slave can simply resume applying its newly
  received WALs to base files.
- Keep track of current number of transfers
  We would also need a way to query the refcount of the
  base backup, so if a slave dies, the master can be
  recovered manually, so it can also resume applying leftover WALs.

b) The slave needs to know whether a relation got added to itself in
order to request a base backup of the relevant files.

I would suggest adding a new wal record type for this:
  - DROP_RELATION_SLAVE(node_id, relation)
  - ADD_RELATION_SLAVE(node_id, relation)

We thought about two ways of administering the replication set:
- slaves with full replication, optionally and explicitely excluding
relations
- slaves with minimal replication, explicitely included relations
The above WAL record types will be needed both cases,
but we also need two new catalog tables:
- the slave nodes, indicating the type of the slave above
- explicit table indications (treated depending on the type of the slave)

Any out of band communication has severe problems with
crashes/unavailability of the slave or not allowing classic, non
streaming, wal replication.

Questions:
- How to deal with the fact that the slave may be unavailable during adding 
something to
  its replication set?
  - Possibly forbid all DDL to the table until the slave got the update

c)
What to filter:
Every slave gets a node_id which is assigned in a system catalog on the master
pg_node(nodeoid) (per cluster)
And a catalog contains all replicated relations.
pg_replication_set(nodeoid, classoid, acknowledged_on_slave) (per database)

How to filter:
Heap2, Heap, Btree, Gin, Gist, Storage, Sequence need to be filtered by 
database/relation.
I am by far not yet familiar enough with the relevant code to see if it is 
feasible and worthwile at all to filter clog, transaction and multixact per 
database.

Where to filter:
I propose doing so in the walsender. While this would prohibit using
classical wal based standby I do not see a big problem in that.

If done via wal streaming it would be a simple addition of a node_id in 
PQstartXLogStreaming. This id obviously should not be resettable to something 
else...

Questions:
- How to deal with access to the different database-wide catalogs?
  - Storing that data cluster wide seems really ugly.
  - read the code...

d) I do not have any strong or even moderate opinions about this. I think its 
sensible to get something prototypish, function based done before deciding 
about the real interface and getting into syntax wars.


Steps:

1:
- Transfer of Relations during runtime
  - needed to use wal-splitting
  - internally:
- Possibility 1:
  - transfer_command = ... %filename%
  - should not error out if data changes beneath it.
  - called for every file (i.e. 

Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-29 Thread Itagaki Takahiro

Boszormenyi Zoltan z...@cybertec.at wrote:

 we tried to discuss on a lower level what should be needed
 for a partial replication based on streaming replication.

We need to discuss a partial recovery before the partial replication.

There are some related items in out ToDo list and previous discussions:

http://wiki.postgresql.org/wiki/Todo
- Allow WAL logging to be turned off for a table,
  but the table might be dropped or truncated during crash recovery
- Allow WAL logging to be turned off for a table,
  but the table would avoid being truncated/dropped

- rmgr_hook
http://archives.postgresql.org/pgsql-hackers/2008-12/msg01361.php

Could you try them first as a preparation for the partial replication?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-29 Thread Fujii Masao
On Mon, Nov 30, 2009 at 4:56 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 I don't think its reasonable trying to discuss and implement this all
 in one huge patch so I propose implementing at least 1) as a seperate
 patch.

I agree with this development plan.

 Now there unforatunately are two different schools of thought how to 
 implement transfering base data.
 1. archive_command like transfer command taking a source path/host and target 
 path/host
   - very flexible (think e.g. not transferring the data for multiple slaves 
 over the whole country)
   - harder to setup
   - more in style of classical wal archiving
 2. add the capability to the WAL Streaming patch's libpq based protocol
   - no additional configuration needed
   - inflexible
   - makes usage from non streaming replication is impossible

 I favor 1. but only lightly so.

I favor 2 ;) Because I think that it's too hard for users to set up
a transfer command. One of streaming replication's merits is that
users no longer need to specify a transfer command for log-shipping.
So users can configure and use replication without complex settings.
But, #1 would spoil this merit.

 Detail Questions:
 - How to deal with multiple transfer requests at the same time?
  There would be a need for multiple full backup requests for
  individual tables by several clients at once.
  Currently pg_start_backup() isn't allowed from
  two clients in parallel, the second one gets an error.
  We thought that pg_start_backup() and pg_stop_backup()
  can turn into simple reference counts. IIRC, WALs
  are still generated and _shipped to slaves_ during
  a full backup, they are simply not yet applied to
  base table files. So, in this case a pg_stop_backup()
  issued from a slave decreases refcount of the base backups
  and the slave can simply resume applying its newly
  received WALs to base files.

I'm not sure how. But at first multiple online-backup feature
rather than backup-shipping itself might have to be addressed.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] draft RFC: concept for partial, wal-based replication

2009-11-29 Thread Craig Ringer
Boszormenyi Zoltan wrote:

 c. splitting wal into different replication sets

Just a side note: in addition to its use for partial replication, this
might have potential for performance-prioritizing databases or tablespaces.

Being able to separate WAL logging so that different DBs, tablespaces,
etc went to different sets of WAL logs would allow a DBA to give some
databases or tablespaces dedicated WAL logging space on faster storage.
If partial recovery is implemented, it might also permit less important
databases to be logged to fast-but-unsafe storage such as a non-BBU disk
controller with write cache enabled, without putting more important
databases in the same cluster in danger.

More importantly, if the WAL writing was done in different wal writer
backends, the admin could also use nice and ionice to encourage the OS
to favour WAL logging for some DBs over others.

Currently all these things require splitting the install into multiple
clusters, incurring config management and backup overhead and most
importantly partitioning shared memory.

OTOH, even with split WAL logging, you still have the shared bgwriter to
contend with, and the effects of an unimportant query pushing data
related to more performance-critical DBs out of shm or OS cache. So
perhaps splitting the cluster is actually the best answer, and a
complete implementation of DB prioritization would land up looking a lot
like multiple Pg clusters multiplexed on one port anyway...

In any case, I thought it worth mentioning as something that may be
worth keeping in mind - or considering and disregarding - while looking
at the WAL changes involved in partial replication.

--
Craig Ringer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] draft RFC: concept for partial, wal-based replication

2009-10-30 Thread Hans-Juergen Schoenig -- PostgreSQL

hello ...

as my day has worked out quite nicely up to know i thought to f... it up 
and post a new concept which has been requested by a customer. the goal 
is to collect some feedback, ideas and so on (not to be mixed up with 
flames).
we have funding for this and we are trying to sort out how to do it the 
best way. comments are welcome ...

note, this is a first draft i want to refine based on some comments.
here we go ...


Partial WAL Replication for PostgreSQL:
---

As of now the PostgreSQL community has provided patches and functionalities
which allow full WAL-based replication as well as hot-standby. To extend 
this

functionality and to make PostgreSQL even more suitable for enterprise
computing than it is today, we have the commitment of a sponsor to fund 
partial

replication for PostgreSQL 8.5 / 8.6.

This is the first draft of a proposal to make partial WAL-based 
replication work
and to provide and additional set of fancy features to the community 
which has

been waiting for real in-core replication for a decade or more.


Why partial replication?


In some cases people have master servers which contain enormous amounts 
of data
(XX TB or so). If more than just one replica of this data is needed it 
might
happen that different slaves are used for different purposes.  This 
implies that

not all data will be used by all maschines.
An example: Consider a server at a phone company collecting phone calls, 
billing
data, and maybe network routing data. Data is used by different 
department and

one maschine is not enough to serve all three departments. With the new
functionality proposed here we could make 3 replicas each holding just a 
group
of tables for specific tasks thus allowing people to buy cheaper 
hardware for

slaves and use more maschines instead.


Current status:
---

Hot-standy and streaming replication have been a huge leap step forward 
for the
community and what is proposed here will be an extension to those 
patches and
functionalities. This concept is NOT aimed to replace anything - it is 
mainly an

addon.


Nodes and replication filters:
--

As of 8.4 standby systems are done by creating an archive_command along 
with a

base backup. Although it is easy to do some users still reported some
difficulties due to a total misunderstanding of PITR.

The idea is to add a functionality to add slaves like this:

CREATE REPLICA node_name
   CONNECT FROM SLAVE 'connect_string'
   TRANSFER COMMAND 'command'
   [ USING replication_filter ];

'command' would be any shell script copying data from the local master 
to the
new database node called node_name. Replication filters can be used to 
make X

replicas contain the same tables. Filtersets can be created like this:

CREATE REPLICATION FILTER filter_name
   [ EMPTY | FULL ] [ INCLUDE | EXCLUDE CHANGES ];

Replication filters can be modified ...

ALTER REPLICATION FILTER filter_name RENAME TO new_filtername;
ALTER REPLICATION FILTER filter_name
   { ADD | REMOVE } { TABLE | INDEX | SEQUENCE } object;

Filter sets can be dropped like this ...

DROP REPLICATION FILTER filter_name;

Internally CREATE REPLICA would initiate a base backup to the new slave 
server
just like we would do it manually otherwise. The server would 
automatically use
the user defined 'command' to copy one file after the other to the slave 
box.
The idea is basically stolen from archive_command and friends. At this 
stage we
either copy the entire instance as we would do it with a normal base 
backup or

just what is needed (defined by the replication filter). Users would
automatically only copy data to a slave which is really needed there and 
which
matches their filter config. If the copy is done, we can register the 
new node

inside a system table and commit the transaction. Also, we can automatically
create a useful recovery.conf setup - we know how to connect from the 
slave to

the master (we can use ' CONNECT FROM SLAVE [ USING ] ' to write a proper
recovery.conf file).

Tables can easily be added or removed from a replication filter with ALTER
REPLICATION FILTER.

Replicas can be removed easily:

DROP REPLICA node_name;

Why SQL to add a node? We are convinced that this is the most simplistic 
way of

doing things.  It is the most intuitive way of doing things.  We believe it
gives users a real feeling of simplicity. The current way of doing base 
backups
should stay in place as it is - it has proven to be nice for countless 
tasks.
However, it is not suitable for managing 10 or more replicas easily. 
Especially

not when they are not full blown copies of the master.


Technical ideas:


System tables:

We suggest to always replicate the entire system catalog.  It woulde be 
a total
disaster to try some other implementation. The same applies for other 
tables - we

always replicate entire tables; no WHERE-clauses allowed when it comes to