Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-04 Thread Jim Nasby

On 5/3/12 2:54 PM, Josh Berkus wrote:

(2)  If logical transactions had been implemented as additions to
  the WAL stream, and Slony was using that, do you think they would
  still have been usable for this recovery?

Quite possibly not.


The key advantage that I see in londiste/slony replication is that your data 
stream has absolutely nothing to do with anything binary or internal to 
Postgres. That means that the only way corruption will travel from a master to 
a slave is if the corruption is in the actual fields being updated, and even 
that's not a given (ie: UPDATING a field to a completely new value would not 
propagate corruption even if the old value of the field was corrupted).

So, embedding a logical stream into WAL is not inherently bad... what would be bad is if 
that logical stream was susceptible to corruption due to something like full 
page writes. Simply embedding the exact same info slony or londiste captures into the WAL 
should be fine (though likely defeats the purpose). Translating binary WAL data into DML 
statements would very likely allow corruption to travel from master to slave.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-03 Thread Josh Berkus

 One thing I wanted to mention is that non-binary replication has an
 added advantage over binary from a DR standpoint: if corruption occurs
 on a master it is more likely to make it into your replicas thanks to
 full page writes. You might want to consider that depending on how
 sensitive your data is.

Yeah, we've seen this a few times.  We just recently had to rescue a
client from HS-wide corruption using Slony.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-03 Thread Kevin Grittner
Josh Berkus j...@agliodbs.com wrote:
 
 One thing I wanted to mention is that non-binary replication has
 an added advantage over binary from a DR standpoint: if
 corruption occurs on a master it is more likely to make it into
 your replicas thanks to full page writes. You might want to
 consider that depending on how sensitive your data is.
 
 Yeah, we've seen this a few times.  We just recently had to rescue
 a client from HS-wide corruption using Slony.
 
That's an interesting point.  Out of curiosity, how did the
corruption originate?
 
It suggests a couple questions:
 
(1)  Was Slony running before the corruption occurred?  If not, how
was Slony helpful?  I know that in our environment, where we have
both going through separate streams, with a repository of the
logical transactions, we would use PITR recovery to get to the
latest known good state which we could easily identify, and then
replay the logical transactions to top it off to get current.  If
necessary we could skip logical transactions which were problematic
results of the corruption.
 
(2)  If logical transactions had been implemented as additions to
the WAL stream, and Slony was using that, do you think they would
still have been usable for this recovery?
 
Perhaps sending both physical and logical transaction streams over
the WAN isn't such a bad thing, if it gives us more independent
recovery mechanisms.  That's fewer copies than we're sending with
current trigger-based techniques.  It would be particularly
attractive is we could omit (filter out) certain tables before going
across the WAN.  I would be willing to risk sending the big
raster-scanned documents through just the physical channel so long
as I had a nightly compare of md5sum values on both sides so we can
resend any corrupted data (or tell people to rescan).
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-03 Thread Josh Berkus

 That's an interesting point.  Out of curiosity, how did the
 corruption originate?

We're still not sure.  It appears to be in the system catalogs, though.
 Note that the original master developed memory issues.

 It suggests a couple questions:
  
 (1)  Was Slony running before the corruption occurred?  

No.

 If not, how
 was Slony helpful?  

Install, replicate DB logically, new DB works fine.

 (2)  If logical transactions had been implemented as additions to
 the WAL stream, and Slony was using that, do you think they would
 still have been usable for this recovery?

Quite possibly not.

 Perhaps sending both physical and logical transaction streams over
 the WAN isn't such a bad thing, if it gives us more independent
 recovery mechanisms.  That's fewer copies than we're sending with
 current trigger-based techniques. 

Frankly, there's nothing wrong with the Slony model for replication
except for the overhead of:
1. triggers
2. queues
3. Running DDL

However, the three above are really big issues.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-02 Thread Jim Nasby

On 4/29/12 9:27 AM, Kevin Grittner wrote:

Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data.


Thanks for the write-up, it was insightful.

One thing I wanted to mention is that non-binary replication has an added 
advantage over binary from a DR standpoint: if corruption occurs on a master it 
is more likely to make it into your replicas thanks to full page writes. You 
might want to consider that depending on how sensitive your data is.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Kevin Grittner
Tom Lane  wrote:
 Kevin Grittner  writes:
 
 Well, the xReader approach (Aakash's GSoC project) is to serve as
 a proxy for a WAL stream going to a hot standby, to interpret each
 incoming WAL record by cheating and querying the HS before
 passing the WAL along, and then using sync rep to the HS
 regardless of whether the incoming stream is sync or async. What
 reliability problem do you see with this approach?

 Well, first off, it requires a replica that *is* a physical replica
 (else it won't have the same table OIDs, for instance). I'm not
 sure I get the point of logical replication that requires a
 physical replica as a prerequisite.
 
Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data.  Due to state law, state supreme
court rules, and the general suckiness of the WAN available to us,
each of Wisconsin's 72 counties has a database server in the county
seat which is the official data of record.  We have two parallel
forms of backup -- WAL-based and logical.
 
For WAL-based replication we maintain a copy of the last two weekly
base backups and all WAL files needed to recover from either of them
to any point in time.  One copy of this is required to be in the
county, on a separate system with separate drives.  The other copy is
maintained centrally.  One rather large server has a hot standby of
all county databases, currently fed through copying WAL files
(although we plan to move to streaming this year).  These physical
backups are the primary source for recovering from catastrophic
failure of a county server.
 
We have software to top off the recovered database with
transactions from the logical transaction stream as needed, since
that is more up-to-date.  We have also found this combination of
physical and logical transactions useful when someone trying to do a
data fix with direct SQL statements against a production database
mangled things badly.  We use PITR recovery up to just before the
damage, then apply logical transactions from that point forward,
skipping the problem transactions.
 
Until recently our logical transaction stream was generated from our
Java middle tier running on the database servers.  This has recently
been converted to use a generalized PostgreSQL trigger, written in C,
which is attached to all tables for which we want logical replication
(about 90% of them).  The trigger captures data into transaction
tables in the database as part of the same transaction they are
representing, so that the generation of the logical transaction data
is atomic with the represented data.  We didn't switch to one of the
existing solutions because of all the code downstream which uses the
logical transaction streams -- we didn't want to change the format or
delivery method of the logical steam because it's simply not feasible
to change everything at once.
 
The way the logical stream is currently delivered is still through
the aforementioned Java middle tier.  It reads the logical
transaction log tables and sends the same data in non-persistent JMS
messages through queues to the TRAN client at the central location,
just as we did for years before we ever heard of PostgreSQL.
Normally, the TRAN client reads each logical transaction once, and
feeds many replication targets at the central site.  These include:
 
  -  four Central Information Repository (CIR) databases each of
 which contains data from all 72 counties  (two machines each at
 two different sites)

  -  four Database Transaction Repository (DTR) databases on the
 same servers as the CIR databases (but separate RAIDs)  This is
 used as a supplementary source for recovery (as mentioned
 above), as well as for auditing what happened on a case when
 there is any question, and certain activity reports.

  -  a filter to identify transactions of possible interest to the
 publishers we use for interfaces to other agencies (district
 attorneys, public defenders, department of corrections, police
 agencies, social service agencies, etc.)  Transactions of
 interest are queued for closer review by the publisher software,
 which actually posts messages if the transaction is found to
 actually be of interest.

  -  other databases for the state court's central services for
 such things as court interpreters used for non-English speakers
 in the various counties
 
The number and varied nature of the targets is significant, because
the fan out and timing differences need to be considered.  The TRAN
client normally receives one copy of the logical data for each
transaction, and about 99% of the time the transaction's data is
replicated to all databases (and queued for review by publishers)
within a second of someone pressing Enter 300 miles away.  It's that
other 1% that gets really interesting.
 
To deal with the possibility that one or more central targets were
down or running very slowly, the TRAN client will give up on keeping
them at the 

Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Simon Riggs
On Sun, Apr 29, 2012 at 3:27 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:

 It would be enormously more performant for the master to be
 emitting logical replication records to start with, since it
 already has all the right names etc at hand at basically no cost.

 Not when the consumers are across a WAN, and that WAN is the biggest
 performance bottleneck and the most expensive resource involved.

I agree that the WAN is important, for both bandwidth and response time.

Though it isn't a given that logical change records (LCRs) will
require more bandwidth than physical WAL. WAL contains full page
images, index changes and other information that would be absent from
the LCR stream. It also depends upon the specification of the LCRs -
what metadata is included and whether the LCRs use text or binary.
Those choices have other impacts as well, so measurements and detailed
analysis is required to justify how to proceed. Which is what is in
progress now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Kevin Grittner
Simon Riggs  wrote:
 
 Though it isn't a given that logical change records (LCRs) will
 require more bandwidth than physical WAL.
 
It would help if you could share a little more of what you have in
mind.  It's not yet clear to me whether you're talking about adding
logical records to the WAL stream, replacing page-oriented records in
the WAL stream with logical records, or leaving the WAL stream alone
and generating a second, new stream of logical data.
 
 WAL contains full page images, index changes and other information
 that would be absent from the LCR stream.
 
Other than full page images, which could be eliminated in other ways
before streaming the WAL, what is in the WAL file that would not be
needed to maintain an exact replica of the cluster, suitable for
quick deployment in place of a lost cluster for disaster recovery? Or
do you foresee eliminating some of the current page-image oriented
WAL records in favor of logical WAL records, with an expectation that
the logical records will be smaller?
 
 measurements and detailed analysis is required to justify how to
 proceed. Which is what is in progress now.
 
Could you clarify?  Are you referring to this discussion or some
effort at 2Q that hasn't yet been shared with the community?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Simon Riggs
On Sun, Apr 29, 2012 at 7:36 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Simon Riggs  wrote:

 Though it isn't a given that logical change records (LCRs) will
 require more bandwidth than physical WAL.

 It would help if you could share a little more of what you have in
 mind.  It's not yet clear to me whether you're talking about adding
 logical records to the WAL stream, replacing page-oriented records in
 the WAL stream with logical records, or leaving the WAL stream alone
 and generating a second, new stream of logical data.

 WAL contains full page images, index changes and other information
 that would be absent from the LCR stream.

 Other than full page images, which could be eliminated in other ways
 before streaming the WAL, what is in the WAL file that would not be
 needed to maintain an exact replica of the cluster, suitable for
 quick deployment in place of a lost cluster for disaster recovery? Or
 do you foresee eliminating some of the current page-image oriented
 WAL records in favor of logical WAL records, with an expectation that
 the logical records will be smaller?

 measurements and detailed analysis is required to justify how to
 proceed. Which is what is in progress now.

 Could you clarify?  Are you referring to this discussion or some
 effort at 2Q that hasn't yet been shared with the community?

These are early days yet and I've been deliberately vague on the
format and content of the LCRs. That is to allow the idea that we have
choices to bed in a little, so we can discuss them. The one thing that
does stand clear in my mind at this stage is that the physical
transport of LCRs should be the current streaming infrastructure. So
the different aspects of the design are at differing stages of
certainty (in my mind).

Much of today has been spent on a presentation explaining the thought
processes and options available to us. Deciding between those options
needs to be fact based rather than just a matter of opinion and I want
to ensure we make the right choices, whatever they are.

I've mentioned a few times already that we're working on prototypes
all of which will be shared with the community. RD, in that order.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Robert Haas
On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Translating WAL is a very hard task.

 No kidding.  I would think it's impossible on its face.  Just for
 starters, where will you get table and column names from?  (Looking at
 the system catalogs is cheating, and will not work reliably anyway.)

 IMO, if we want non-physical replication, we're going to need to build
 it in at a higher level than after-the-fact processing of WAL.
 I foresee wasting quite a lot of effort on the currently proposed
 approaches before we admit that they're unworkable.

I think the question we should be asking ourselves is not whether WAL
as it currently exists is adequate for logical replication, but rather
or not it could be made adequate.  For example, suppose that we were
to arrange things so that, after each checkpoint, the first insert,
update, or delete record for a given relfilenode after each checkpoint
emits a special WAL record that contains the relation name, schema
OID, attribute names, and attribute type OIDs.  Well, now we are much
closer to being able to do some meaningful decoding of the tuple data,
and it really doesn't cost us that much.  Handling DDL (and manual
system catalog modifications) seems pretty tricky, but I'd be very
reluctant to give up on it without banging my head against the wall
pretty hard.  The trouble with giving up on WAL completely and moving
to a separate replication log is that it means a whole lot of
additional I/O, which is bound to have a negative effect on
performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Hannu Krosing
On Sun, 2012-04-29 at 16:33 -0400, Robert Haas wrote:
 On Sat, Apr 28, 2012 at 11:06 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Translating WAL is a very hard task.
 
  No kidding.  I would think it's impossible on its face.  Just for
  starters, where will you get table and column names from?  (Looking at
  the system catalogs is cheating, and will not work reliably anyway.)
 
  IMO, if we want non-physical replication, we're going to need to build
  it in at a higher level than after-the-fact processing of WAL.
  I foresee wasting quite a lot of effort on the currently proposed
  approaches before we admit that they're unworkable.
 
 I think the question we should be asking ourselves is not whether WAL
 as it currently exists is adequate for logical replication, but rather
 or not it could be made adequate.  

Agreed. 

 For example, suppose that we were
 to arrange things so that, after each checkpoint, the first insert,
 update, or delete record for a given relfilenode after each checkpoint
 emits a special WAL record that contains the relation name, schema
 OID, attribute names, and attribute type OIDs.  

Not just the first after checkpoint, but also the first after a schema
change, even though will duplicate the wals with changes to system
catalog, it is likely much cheaper overall to always have a fresh
structure in wal stream.

And if we really want to do WAL--logical--SQL_text conversion on a
host separate from the master, we also need to insert there the type
definitions of user-defined types together with at least types output
functions in some form .

So you basically need a large part of postgres for reliably making sense
of WAL.

 Well, now we are much
 closer to being able to do some meaningful decoding of the tuple data,
 and it really doesn't cost us that much.  Handling DDL (and manual
 system catalog modifications) seems pretty tricky, but I'd be very
 reluctant to give up on it without banging my head against the wall
 pretty hard. 

Most straightforward way is to have a more or less full copy of
pg_catalog also on the WAL-filtering / WAL-conversion node, and to use
it in 1:1 replicas of transactions recreated from the WAL .
This way we can avoid recreating any alternate views of the masters
schema.

Then again, we could do it all on master and inside the wal-writing
transaction and thus avoid large chunk of the problems.

If the receiving side is also PostgreSQL with same catalog structure
(i.e same major version) then we don't actually need to handle DDL in
any complicated way, it would be enough to just carry over the changes
to system tables .

The main reason we don't do it currently for trigger-based logical
replication is the restriction of not being able to have triggers on
system tables. 

I hope it is much easier to have the triggerless record generation also
work on system tables.

 The trouble with giving up on WAL completely and moving
 to a separate replication log is that it means a whole lot of
 additional I/O, which is bound to have a negative effect on
 performance.

Why would you give up WAL ?

Or do you mean that the new logical-wal needs to have same commit time
behaviour as WAL to be reliable ?

I'd envision a scenario where the logi-wal is sent to slave or
distribution hub directly and not written at the local host at all. 
An optionally sync mode similar to current sync WAL replication could be
configured. I hope this would run mostly in parallel with local WAL
generation so not much extra wall-clock time would be wasted.

 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 

-- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Robert Haas
On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 I think the question we should be asking ourselves is not whether WAL
 as it currently exists is adequate for logical replication, but rather
 or not it could be made adequate.

 Agreed.

And of course I meant but rather whether or not it could be made
adequate, but I dropped a word.

 For example, suppose that we were
 to arrange things so that, after each checkpoint, the first insert,
 update, or delete record for a given relfilenode after each checkpoint
 emits a special WAL record that contains the relation name, schema
 OID, attribute names, and attribute type OIDs.

 Not just the first after checkpoint, but also the first after a schema
 change, even though will duplicate the wals with changes to system
 catalog, it is likely much cheaper overall to always have a fresh
 structure in wal stream.

Yes.

 And if we really want to do WAL--logical--SQL_text conversion on a
 host separate from the master, we also need to insert there the type
 definitions of user-defined types together with at least types output
 functions in some form .

Yes.

 So you basically need a large part of postgres for reliably making sense
 of WAL.

Agreed, but I think that's a problem we need to fix and not a
tolerable situation at all.  If a user can create a type-output
function that goes and looks at the state of the database to determine
what to output, then we are completely screwed, because that basically
means you would need to have a whole Hot Standby instance up and
running just to make it possible to run type output functions.  Now
you might be able to build a mechanism around that that is useful to
some people in some situations, but wow does that sound painful.  What
I want is for the master to be able to cheaply rattle off the tuples
that got inserted, updated, or deleted as those things happen; needing
a whole second copy of the database just to do that does not meet my
definition of cheap.  Furthermore, it's not really clear that it's
sufficient anyway, since there are problems with what happens before
the HS instance reaches consistency, what happens when it crashes and
restarts, and how do we handle the case when the system catalog we
need to examine to generate the logical replication records is
access-exclusive-locked?  Seems like a house of cards.

Some of this might be possible to mitigate contractually, by putting
limits on what type input/output functions are allowed to do.  Or we
could invent a new analog of type input/output functions that is
explicitly limited in this way, and support only types that provide
it.  But I think the real key is that we can't rely on catalog access:
the WAL stream has to have enough information to allow the reader to
construct some set of in-memory hash tables with sufficient detail to
reliably decode WAL.  Or at least that's what I'm thinking.

 Most straightforward way is to have a more or less full copy of
 pg_catalog also on the WAL-filtering / WAL-conversion node, and to use
 it in 1:1 replicas of transactions recreated from the WAL .
 This way we can avoid recreating any alternate views of the masters
 schema.

See above; I have serious doubts that this can ever be made to work robustly.

 Then again, we could do it all on master and inside the wal-writing
 transaction and thus avoid large chunk of the problems.

 If the receiving side is also PostgreSQL with same catalog structure
 (i.e same major version) then we don't actually need to handle DDL in
 any complicated way, it would be enough to just carry over the changes
 to system tables .

I agree it'd be preferable to handle DDL in terms of system catalog
updates, rather than saying, well, this is an ALTER TABLE .. RENAME.
But you need to be able to decode tuples using the right tuple
descriptor, even while that's changing under you.

 Why would you give up WAL ?

For lack of ability to make it work.  Don't underestimate how hard
it's going to nail this down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 So you basically need a large part of postgres for reliably making sense
 of WAL.

 Agreed, but I think that's a problem we need to fix and not a
 tolerable situation at all.  If a user can create a type-output
 function that goes and looks at the state of the database to determine
 what to output, then we are completely screwed, because that basically
 means you would need to have a whole Hot Standby instance up and
 running just to make it possible to run type output functions.

You mean like enum_out?  Or for that matter array_out, record_out,
range_out?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-29 Thread Robert Haas
On Sun, Apr 29, 2012 at 11:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sun, Apr 29, 2012 at 6:00 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 So you basically need a large part of postgres for reliably making sense
 of WAL.

 Agreed, but I think that's a problem we need to fix and not a
 tolerable situation at all.  If a user can create a type-output
 function that goes and looks at the state of the database to determine
 what to output, then we are completely screwed, because that basically
 means you would need to have a whole Hot Standby instance up and
 running just to make it possible to run type output functions.

 You mean like enum_out?  Or for that matter array_out, record_out,
 range_out?

Yeah, exactly.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Simon Riggs
On Fri, Apr 27, 2012 at 11:18 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Andres Freund and...@2ndquadrant.com wrote:

 I'm not convinced that I would rather see the logic fixed inside the
 master as opposed to being deployable on the master's machine, the
 slave machine, or even on its own machine in between.

There are use cases where the translation from WAL to logical takes
place on the master, the standby or other locations.

It's becoming clear that filtering records on the source is important
in high bandwidth systems, so the initial work focuses on putting that
on the master, i.e. the source. Which was not my first thought
either. If you use cascading, this would still allow you to have
master - standby - logical.

Translating WAL is a very hard task. Some time ago, I did also think
an external tool would help (my initial design was called xfilter),
but I no longer think that is likely to work very well apart from very
simple cases.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Aakash Goel
 Aakash, when you get a chance, could you fill in the inch-stones
 from the GSoC proposal page onto the Wiki page?

Sure, http://wiki.postgresql.org/wiki/XReader updated.

On Sat, Apr 28, 2012 at 3:48 AM, Kevin Grittner kevin.gritt...@wicourts.gov
 wrote:

 Andres Freund and...@2ndquadrant.com wrote:

  In the current, prototypal, state there is one component thats
  integrated into the server (because it needs information thats
  only available there).

 The xReader design was based on the idea that it would be nice not
 to cause load on the master machine, and that by proxying the WAL
 stream to the HS, using synchronous replication style to write from
 xReader to the HS, you could use the HS for a source for that data
 with it being at exactly the right point in time to query it.

 I'm not convinced that I would rather see the logic fixed inside the
 master as opposed to being deployable on the master's machine, the
 slave machine, or even on its own machine in between.

  That component is layered ontop of a totally generic xlog
  reading/parsing library that doesn't care at all where its
  running.

 That's cool.

  Its also used in another cluster to read the received (filtered)
  stream.

 I don't quite follow what you're saying there.

  I plan to submit the XLogReader (thats what its called atm)
  before everything else, so everybody can take a look as soon as
  possible.

 Great!  That will allow more discussion and planning.

  I took a *very* short glance over the current wiki description of
  xReader and from that it seems to me it would benefit from trying
  to make it architecturally more similar to the rest of pg.

 We're planning on using existing protocol to talk between pieces.
 Other than breaking it out so that it can run somewhere other than
 inside the server, and allowing clients to connect to xReader to
 listen to WAL events of interest, are you referring to anything
 else?

  I also would suggest reviewing how the current walreceiver/sender,
  and their protocol, work.

 Of course!  The first inch-stone in the GSoC project plan
 basically consists of creating an executable that functions as a
 walreceiver and a walsender to just pass things through from the
 master to the slave.  We build from there by allowing clients to
 connect (again, over existing protocol) and register for events of
 interest, and then recognizing different WAL records to generate
 events.  The project was just going to create a simple client to
 dump the information to disk, but with the time saved by adopting
 what you've already done, that might leave more time for generating
 a useful client.

 Aakash, when you get a chance, could you fill in the inch-stones
 from the GSoC proposal page onto the Wiki page?  I think the
 descriptions of those interim steps would help people understand
 your proposal better.  Obviously, some of the particulars of tasks
 and the dates may need adjustment based on the new work which is
 expected to appear before you start, but what's there now would be a
 helpful reference.

 -Kevin



Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Aakash Goel
Hello Andres,

 The xReader design was based on the idea that it would be nice not
 to cause load on the master machine, and that by proxying the WAL
 stream to the HS, using synchronous replication style to write from
 xReader to the HS, you could use the HS for a source for that data
 with it being at exactly the right point in time to query it.
Yes, that does make sense for some workloads. I don't think its viable for
everything though, thats why were not aiming for that ourselves atm.

Regarding the above, what would be a case where querying the HS will not
suffice?

On Sat, Apr 28, 2012 at 4:02 AM, Andres Freund and...@2ndquadrant.comwrote:

 Hi Kevin, Hi Aakash,

 On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:
  Andres Freund and...@2ndquadrant.com wrote:
   In the current, prototypal, state there is one component thats
   integrated into the server (because it needs information thats
   only available there).
  The xReader design was based on the idea that it would be nice not
  to cause load on the master machine, and that by proxying the WAL
  stream to the HS, using synchronous replication style to write from
  xReader to the HS, you could use the HS for a source for that data
  with it being at exactly the right point in time to query it.
 Yes, that does make sense for some workloads. I don't think its viable for
 everything though, thats why were not aiming for that ourselves atm.

  I'm not convinced that I would rather see the logic fixed inside the
  master as opposed to being deployable on the master's machine, the
  slave machine, or even on its own machine in between.
 I don't think that you can do everything apart from the master. We
 currently
 need shared memory for coordination between the moving parts, thats why we
 have it inside the master.
 It also have the advantage of being easier to setup.

   That component is layered ontop of a totally generic xlog
   reading/parsing library that doesn't care at all where its
   running.
  That's cool.

   Its also used in another cluster to read the received (filtered)
   stream.
  I don't quite follow what you're saying there.
 To interpret the xlog back into something that can be used for replication
 you
 need to read it again. After filtering we again write valid WAL, so we can
 use
 the same library on the sending|filtering side and on the receiving side.
 But thats actually off topic for this thread ;)


   I took a *very* short glance over the current wiki description of
   xReader and from that it seems to me it would benefit from trying
   to make it architecturally more similar to the rest of pg.
  We're planning on using existing protocol to talk between pieces.
  Other than breaking it out so that it can run somewhere other than
  inside the server, and allowing clients to connect to xReader to
  listen to WAL events of interest, are you referring to anything
  else?
 It sounds like the xReader is designed to be one multiplexing process.
 While
 this definitely has some advantages resource-usage-wise it doesn't seem to
 be
 fitting the rest of the design that well. The advantages might outweigh
 everything else, but I am not sure about that.
 Something like registering/deregistering also doesn't fit that well with
 the
 way walsender works as far as I understand it.

 Greetings,

 Andres
 --
  Andres Freund http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Translating WAL is a very hard task.

No kidding.  I would think it's impossible on its face.  Just for
starters, where will you get table and column names from?  (Looking at
the system catalogs is cheating, and will not work reliably anyway.)

IMO, if we want non-physical replication, we're going to need to build
it in at a higher level than after-the-fact processing of WAL.
I foresee wasting quite a lot of effort on the currently proposed
approaches before we admit that they're unworkable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Kevin Grittner
Tom Lane  wrote:
 Simon Riggs  writes:
 Translating WAL is a very hard task.
 
 No kidding. I would think it's impossible on its face. Just for
 starters, where will you get table and column names from? (Looking
 at the system catalogs is cheating, and will not work reliably
 anyway.)
 
Well, the xReader approach (Aakash's GSoC project) is to serve as a
proxy for a WAL stream going to a hot standby, to interpret each
incoming WAL record by cheating and querying the HS before passing
the WAL along, and then using sync rep to the HS regardless of
whether the incoming stream is sync or async.  What reliability
problem do you see with this approach?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane  wrote:
 Simon Riggs  writes:
 Translating WAL is a very hard task.
 No kidding. I would think it's impossible on its face.
 
 Well, the xReader approach (Aakash's GSoC project) is to serve as a
 proxy for a WAL stream going to a hot standby, to interpret each
 incoming WAL record by cheating and querying the HS before passing
 the WAL along, and then using sync rep to the HS regardless of
 whether the incoming stream is sync or async.  What reliability
 problem do you see with this approach?

Well, first off, it requires a replica that *is* a physical replica
(else it won't have the same table OIDs, for instance).  I'm not sure I
get the point of logical replication that requires a physical replica as
a prerequisite.

Next, it breaks immediately in the face of DDL:

CREATE TABLE foo (f1 int);
BEGIN;
ALTER TABLE foo ALTER COLUMN f1 RENAME TO f2;
INSERT INTO foo (f2) VALUES (1);
COMMIT;

The standby is not going to think that the ALTER is committed, so it
will not report the right column name when it comes time to translate
the INSERT.  Actually, you can break it even more easily than that:

CREATE TABLE bar AS SELECT ...

What will you do with the insertions executed by this CREATE?  They have
to be executed before the creation of table bar is committed.

Also, you'd need a query connection per database (at least, maybe one
per session if you were trying to track DDL effects), which seems rather
a lot of load on the HS slave.  That together with the requirement for
synchronous operation seems absolutely catastrophic from a performance
standpoint.

If you want to spend the summer building a toy, fine, but I don't see
this going anywhere for production purposes.  It would be enormously
more performant for the master to be emitting logical replication
records to start with, since it already has all the right names etc
at hand at basically no cost.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Aakash Goel
Hello Tom,

 I'm not sure I
 get the point of logical replication that requires a physical replica as
 a prerequisite.

 It would be enormously
 more performant for the master to be emitting logical replication
 records to start with, since it already has all the right names etc
 at hand at basically no cost.

I want to emphasize that any system which uses the logical reader setup,
including a replication system, tends to be very computationally intensive
on the database which it queries all so often. In most of the environments,
the source database is the main database, and as such, any performance
degradation on this database is very bad.

On the other hand, if we offload almost all the work to the physical
replica, our source database, which is the main database, still functions
at the same throughput.

Thus, at the cost of having the replication system as a whole run a little
slower than it could if it were using the main database, we have made sure
that our performance critical main source database is not affected in
performance at all.


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-28 Thread Simon Riggs
On Sat, Apr 28, 2012 at 4:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane  wrote:
 Simon Riggs  writes:
 Translating WAL is a very hard task.
 No kidding. I would think it's impossible on its face.

 Well, the xReader approach (Aakash's GSoC project) is to serve as a
 proxy for a WAL stream going to a hot standby, to interpret each
 incoming WAL record by cheating and querying the HS before passing
 the WAL along, and then using sync rep to the HS regardless of
 whether the incoming stream is sync or async.  What reliability
 problem do you see with this approach?

 Well, first off, it requires a replica that *is* a physical replica
 (else it won't have the same table OIDs, for instance).

It's even harder than that because the physical replica needs to be
able to be shutdown/restarted. We currently restart WAL from last
checkpoint, but if changes have already been made and committed then
the catalog will be incorrect when we replay the WAL again. So, I
think saying very hard is justified.

We basically have a choice of shipping full metadata with every change
or finding some way to avoid doing that. Investigating the latter
seems like a worthwhile task but there are limits, as you say.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-27 Thread Kevin Grittner
[replaced bad email address for Josh (which was my fault)] 
 
Aakash Goel aakash.b...@gmail.com wrote: 
 All, the wiki page is now up at
  http://wiki.postgresql.org/wiki/XReader.
 
Note that the approach Aakash is taking doesn't involve changes to
the backend code, it is strictly a standalone executable to which
functions as a proxy to a hot standby and to which clients like
replications systems connect.  There is a possible additional
configuration which wouldn't require a hot standby, if time permits.
I am not clear on whether 2nd Quadrant's code takes this approach
or builds it into the server.  I think we need to know that much
before we can get very far in discussion.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-27 Thread Andres Freund
On Friday, April 27, 2012 11:04:04 PM Kevin Grittner wrote:
 [replaced bad email address for Josh (which was my fault)]
 
 Aakash Goel aakash.b...@gmail.com wrote:
  All, the wiki page is now up at
  
   http://wiki.postgresql.org/wiki/XReader.
 
 Note that the approach Aakash is taking doesn't involve changes to
 the backend code, it is strictly a standalone executable to which
 functions as a proxy to a hot standby and to which clients like
 replications systems connect.  There is a possible additional
 configuration which wouldn't require a hot standby, if time permits.
 I am not clear on whether 2nd Quadrant's code takes this approach
 or builds it into the server.  I think we need to know that much
 before we can get very far in discussion.
In the current, prototypal, state there is one component thats integrated into 
the server (because it needs information thats only available there). That 
component is layered ontop of a totally generic xlog reading/parsing library 
that doesn't care at all where its running. Its also used in another cluster 
to read the received (filtered) stream.
I plan to submit the XLogReader (thats what its called atm) before everything 
else, so everybody can take a look as soon as possible.

I took a *very* short glance over the current wiki description of xReader and 
from that it seems to me it would benefit from trying to make it 
architecturally more similar to the rest of pg. I also would suggest reviewing 
how the current walreceiver/sender, and their protocol, work.

Andres


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-27 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 
 In the current, prototypal, state there is one component thats
 integrated into the server (because it needs information thats
 only available there).
 
The xReader design was based on the idea that it would be nice not
to cause load on the master machine, and that by proxying the WAL
stream to the HS, using synchronous replication style to write from
xReader to the HS, you could use the HS for a source for that data
with it being at exactly the right point in time to query it.
 
I'm not convinced that I would rather see the logic fixed inside the
master as opposed to being deployable on the master's machine, the
slave machine, or even on its own machine in between.
 
 That component is layered ontop of a totally generic xlog
 reading/parsing library that doesn't care at all where its
 running.
 
That's cool.
 
 Its also used in another cluster to read the received (filtered)
 stream.
 
I don't quite follow what you're saying there.
 
 I plan to submit the XLogReader (thats what its called atm)
 before everything else, so everybody can take a look as soon as
 possible.
 
Great!  That will allow more discussion and planning.
 
 I took a *very* short glance over the current wiki description of
 xReader and from that it seems to me it would benefit from trying
 to make it architecturally more similar to the rest of pg.
 
We're planning on using existing protocol to talk between pieces. 
Other than breaking it out so that it can run somewhere other than
inside the server, and allowing clients to connect to xReader to
listen to WAL events of interest, are you referring to anything
else?
 
 I also would suggest reviewing how the current walreceiver/sender,
 and their protocol, work.
 
Of course!  The first inch-stone in the GSoC project plan
basically consists of creating an executable that functions as a
walreceiver and a walsender to just pass things through from the
master to the slave.  We build from there by allowing clients to
connect (again, over existing protocol) and register for events of
interest, and then recognizing different WAL records to generate
events.  The project was just going to create a simple client to
dump the information to disk, but with the time saved by adopting
what you've already done, that might leave more time for generating
a useful client.
 
Aakash, when you get a chance, could you fill in the inch-stones
from the GSoC proposal page onto the Wiki page?  I think the
descriptions of those interim steps would help people understand
your proposal better.  Obviously, some of the particulars of tasks
and the dates may need adjustment based on the new work which is
expected to appear before you start, but what's there now would be a
helpful reference.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-27 Thread Andres Freund
Hi Kevin, Hi Aakash,

On Saturday, April 28, 2012 12:18:38 AM Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  In the current, prototypal, state there is one component thats
  integrated into the server (because it needs information thats
  only available there).
 The xReader design was based on the idea that it would be nice not
 to cause load on the master machine, and that by proxying the WAL
 stream to the HS, using synchronous replication style to write from
 xReader to the HS, you could use the HS for a source for that data
 with it being at exactly the right point in time to query it.
Yes, that does make sense for some workloads. I don't think its viable for 
everything though, thats why were not aiming for that ourselves atm.

 I'm not convinced that I would rather see the logic fixed inside the
 master as opposed to being deployable on the master's machine, the
 slave machine, or even on its own machine in between.
I don't think that you can do everything apart from the master. We currently 
need shared memory for coordination between the moving parts, thats why we 
have it inside the master.
It also have the advantage of being easier to setup.

  That component is layered ontop of a totally generic xlog
  reading/parsing library that doesn't care at all where its
  running.
 That's cool.

  Its also used in another cluster to read the received (filtered)
  stream.
 I don't quite follow what you're saying there.
To interpret the xlog back into something that can be used for replication you 
need to read it again. After filtering we again write valid WAL, so we can use 
the same library on the sending|filtering side and on the receiving side.
But thats actually off topic for this thread ;)


  I took a *very* short glance over the current wiki description of
  xReader and from that it seems to me it would benefit from trying
  to make it architecturally more similar to the rest of pg.
 We're planning on using existing protocol to talk between pieces.
 Other than breaking it out so that it can run somewhere other than
 inside the server, and allowing clients to connect to xReader to
 listen to WAL events of interest, are you referring to anything
 else?
It sounds like the xReader is designed to be one multiplexing process. While 
this definitely has some advantages resource-usage-wise it doesn't seem to be 
fitting the rest of the design that well. The advantages might outweigh 
everything else, but I am not sure about that.
Something like registering/deregistering also doesn't fit that well with the 
way walsender works as far as I understand it.

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-04-27 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:
 
 Something like registering/deregistering also doesn't fit that
 well with the way walsender works as far as I understand it.
 
If you look at the diagrams on the xReader Wiki page, the lines
labeled XLOG stream are the ones using walsender/walreceiver.  The
green arrows represent normal connections to the database, to run
queries to retrieve metadata needed to interpret the WAL records,
and the lines labeled Listener n are expected to use the pg
protocol to connect, but won't be talking page-oriented WAL -- they
will be dealing with logical interpretation of the WAL.  The sort of
data which could be fed to a database which doesn't have the same
page images.  Like Slony et al do.
 
Perhaps, given other points you made, the library for interpreting
the WAL records could be shared, and hopefully a protocol for the
clients, although that seems a lot more muddy to me at this point. 
If we can share enough code, there may be room for both approaches
with minimal code duplication.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers