Re: [HACKERS] Proposal: Commit timestamp

2007-02-21 Thread José Orlando Pereira
On Friday 09 February 2007, Jan Wieck wrote:
 I am not sure, I would have to look at what exactly that hook provides.
 The key to a Lamport timestamp is that it is advancing it commit order
 (plus some other things ... of course). If the hook can guarantee that
 the calls are made always in commit order, serialized without any race
 condition possible, it would probably be suitable.

Actually what we do is a bit stronger. We use the commit hook to enforce an 
externally defined commit order. In our case, this is defined by a group 
communication protocol, which is even allowed to reorder a pair of 
transactions originating from the same replica. Therefore, achieving a commit 
order that is consistent with a local clock should be straightforward.

Regards,

-- 
Jose Orlando Pereira

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/8/2007 11:41 PM, Richard Troy wrote:

On Thu, 8 Feb 2007, Joshua D. Drake wrote:


Well how deep are we talking here? My understanding of what Jan wants to
do is simple.

Be able to declare which triggers are fired depending on the state of
the cluster.

In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
or Slave.

This is useful because I may have a trigger on the Master and the same
trigger on the Slave. You do not want the trigger to fire on the Slave
because we are doing data replication. In short, the we replicate the
result, not the action.

However, you may want triggers that are on the Slave to fire separately.
A reporting server that generates materialized views is a good example.
Don't tie up the Master with what a Slave can do.



It'd be great if Jan considers the blending of replication; 


Please elaborate. I would really like to get all you can contribute.




 I have no clue what got you into what you are doing here.


Jan, some sleep now and then might be helpful to your public disposition.


Richard, don't embarrass Bruce. He doesn't need your help.

I have been with this project and know Bruce Momjian for more than 10 
years. Every now and then, Bruce and I get into some sort of eventually 
publicly visible dispute that doesn't really mean much. I'll probably 
spend next Christmas with him and his family again, play a few rounds of 
backgammon with Wilma (who I really owe a revenge), hopefully don't 
interfere too much with Christine's work (especially when it involves 
handling food over a white carpet) and none of us will even remember 
this crap. Our friendship has been through some real tests. Any real 
problem we would have, we'd never discuss here. We would just meet and 
talk.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/7/2007 7:13 AM, José Orlando Pereira wrote:

On Saturday 03 February 2007, Bruce Momjian wrote:

Jan Wieck wrote:
 I don't have any such paper and the proof of concept will be the
 implementation of the system. I do however see enough resistance against
 this proposal to withdraw the commit timestamp at this time. The new
 replication system will therefore require the installation of a patched,
 non-standard PostgreSQL version, compiled from sources cluster wide in
 order to be used. I am aware that this will dramatically reduce it's
 popularity but it is impossible to develop this essential feature as an
 external module.

 I thank everyone for their attention.

Going and working on it on your own doesn't seem like the proper
solution.  I don't see people objecting to adding it, but they want it
work, which I am sure you want too.  You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.


Hi,

Would it be possible to solve the problem using the GORDA on-commit hook?

Jan would be able reliably obtain a commit timestamp with the desired 
semantics and store it in a regular table within transaction boundaries.


I am not sure, I would have to look at what exactly that hook provides. 
The key to a Lamport timestamp is that it is advancing it commit order 
(plus some other things ... of course). If the hook can guarantee that 
the calls are made always in commit order, serialized without any race 
condition possible, it would probably be suitable.



Jan



PostgreSQL would not have to commit to a specific timestamp semantics and the 
patch is quite small.


Regards,




--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Kris Jurka

Richard Troy wrote:


It'd be great if Jan considers the blending of replication; any given DB
instance shouldn't be only a master/originator or only a slave/subscriber.
A solution that lets you blend replication strategies in a single db is,
from my point of view, very important.



Perhaps if more people read Jan's posts he wouldn't be so frustrated.

http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php

He clearly describes that the master/slave setting is per session, not 
per database.


Kris Jurka

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread J. Andrew Rogers

On Feb 8, 2007, at 8:41 PM, Richard Troy wrote:
It'd be great if Jan considers the blending of replication; any  
given DB
instance shouldn't be only a master/originator or only a slave/ 
subscriber.
A solution that lets you blend replication strategies in a single  
db is,

from my point of view, very important.



It might be constructive to define what a minimal complete set of  
replication primitives actually is in addition to which ones should  
be implemented.  In addition to master/slave models, you have Paxos  
algorithms and dynamic reconfiguration models in literature that can  
utilize many of the same primitives but which are very different in  
implementation.  I see the value of Jan's proposal, but perhaps it  
would be better to step back and make some assertions about the  
nature of the core capabilities that will be supported in some  
broader picture.  Having a theoretically (mostly) complete set of  
usable primitives would be an incredibly powerful feature set.


Cheers,

J. Andrew Rogers
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Jan Wieck wrote:
  [ I wrote ]
  It'd be great if Jan considers the blending of replication;

 Please elaborate. I would really like to get all you can contribute.

Thanks Jan,

prefaced that I really haven't read everything you've written on this (or
what other people are doing, either), and that I've got a terrible flu
right now (fever, etc), I'll give it a go - hopefully it's actually
helpful. To wit:

In general terms, blending of replication [techniques] means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

At the very least, a schema division strategy for replication leaverages
an existing DB-component binding/dividing mechanism that most everyone is
familliar with. While there are/may be database-wide, nay, installation-
wide constructs as in your Commit Timestamp proposal, I don't see that
there's any conflict - at least, from what I understand of existing
systems and proposals to date.

HTH,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Hammond
On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote:
 Jan Wieck wrote:
  On 2/7/2007 10:35 PM, Bruce Momjian wrote:
   I find the term logical proof of it's correctness too restrictive.  It
   sounds like some formal academic process that really doesn't work well
   for us.

  Thank you.

My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.
However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.
1) clock drift rate between disconnected clusters
2) disconnection time
3) transaction rate on the tables, or even rows involved
There are probably more. I think that if Jan implements what he's
described then a very interesting follow-up would be to do the
statistical analysis necessary to quantify the risk of incorrect
ordering while disconnected. (I've got x ms/ms relative clock drift,
and y tps. How long can I run disconnected  before falling under
99.999% probability of correctly ordered transactions?)

 No, I _now_ understand the use case, but when the patch was posted, the
 use case was missing.  I would like to see a repost with the patch, and
 a description of its use so we can all move forward on that.

An additional use case for an on-commit timestamp is in the analysis
of billing transactions in highly concurrent systems. For example,
imagine your billing period is monthly and you have transactions which
start before and end after the end-of-month. Having the on-commit
timestamp for these transactions may help when attempting to reconcile
between transactions and account activities.

Andrew


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

Richard Troy wrote:

In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.

  


I understand that you're just thinking aloud, but overloading namespaces 
in this way strikes me as awful. Applications and extensions, which are 
the things that have need of namespaces, should not have to care about 
replication. If we have to design them for replication we'll be on a 
fast track to nowhere IMNSHO.



cheers

andrew

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 2:27 PM, Richard Troy wrote:

In general terms, blending of replication [techniques] means to me that
one can have a single database instance serve as a master and as a slave
(to use only one set of terminology), and as a multi-master, too, all
simultaneously, letting the DBA / Architect choose which portions serve
which roles (purposes). All replication features would respect the
boundaries of such choices automatically, as it's all blended.


That is specifically what the changes to pg_trigger and pg_rewrite take 
into account. However much you blend different techniques, a single 
transaction on one server will always fall into one of three categories. 
1) It could be the original operation done by the client application. B) 
It could be the actions performed by the replication engine to replay a 
remote transaction. And iii) it can be an administrative operation that 
requires not to be propagated at all.


No matter how many different models you have in parallel, one single 
transaction will be either a master, a slave or an isolated local thing. 
The proposed changes allow to tell the session which of these three 
roles it is playing and the triggers and rules can be configured to fire 
during master/local role, slave role, always or never. That 
functionality will work for master-slave as well as multi-master.


Although my current plan isn't creating such a blended system, the 
proposed trigger and rule changes are designed to support exactly that 
in a 100% backward compatible way.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
 Richard Troy wrote:
  In more specific terms, and I'm just brainstorming in public here, perhaps
  we can use the power of Schemas within a database to manage such
  divisions; commands which pertain to replication can/would include a
  schema specifier and elements within the schema can be replicated one way
  or another, at the whim of the DBA / Architect. For backwards
  compatability, if a schema isn't specified, it indicates that command
  pertains to the entire database.

 I understand that you're just thinking aloud, but overloading namespaces
 in this way strikes me as awful. Applications and extensions, which are
 the things that have need of namespaces, should not have to care about
 replication. If we have to design them for replication we'll be on a
 fast track to nowhere IMNSHO.

Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around. Either way,
claiming need  like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
care? And, what do you mean by design them for replication?

While I'm in no way stuck on blending replication strategies via schemas,
it does strike me as an appropriate concept and I'd preferr to have it
evaluated based on technical merrit - possibly citing workarounds or
solutions to technical issues, which is what I gather has been the
tradition of this group: Use case first, technical merrit second... Other
alternatives, ISTM, will have virtually the same look/feel as a schema
from an external perspective, and the more I think of it the more I think
using schemas is a sound, clean approach. That it offends someones sense
of asthetics STM a poor rationale for not choosing it. Another question
might be: What's lacking in the implementation of schemas that makes this
a poor choice, and what could be done about it without much effort?

Regards,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Richard Troy

On Fri, 9 Feb 2007, Jan Wieck wrote:

 No matter how many different models you have in parallel, one single
 transaction will be either a master, a slave or an isolated local thing.
 The proposed changes allow to tell the session which of these three
 roles it is playing and the triggers and rules can be configured to fire
 during master/local role, slave role, always or never. That
 functionality will work for master-slave as well as multi-master.

 Although my current plan isn't creating such a blended system, the
 proposed trigger and rule changes are designed to support exactly that
 in a 100% backward compatible way.

 Jan

Fantastic! ...At some point you'll be thinking of the management end -
turning it on or off, etc. That might be where these other points come
more into play.

Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 2:19 PM, Andrew Hammond wrote:

On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote:

Jan Wieck wrote:
 On 2/7/2007 10:35 PM, Bruce Momjian wrote:
  I find the term logical proof of it's correctness too restrictive.  It
  sounds like some formal academic process that really doesn't work well
  for us.

 Thank you.


My intuition is that it might be possible to prove that _nothing_ can
provide guaranteed ordering when there is disconnected operation.


As a matter of physics, for two events happening outside of the event 
horizon of each other, the question which happened first is pointless.



However, I think that the clock based ordering Jan has described could
provide _probable_ ordering under disconnected operation. I can see
three variables in the equation that would determine the probability
of correctness for the ordering.


That precisely is the intended functionality. And I can exactly describe 
when two conflicting actions will result in the wrong row to persist. 
This will happen when the second update to the logically same row will 
be performed on the server with the Lamport timestamp lagging behind by 
more than the time between the two conflicting commits. Example: User 
fills out a form, submits, hits back button, corrects input and submits 
again within 3 seconds. Load balancing sends both requests to different 
servers and the first server is 3.0001 seconds ahead ... the users typo 
will be the winner.


My Lamport timestamp conflict resolution will not be able to solve this 
problem. However, when this happens, one thing is guaranteed. The update 
from the second server, arriving on the first for replication will be 
ignored because a locally generated row is newer. This fact can be used 
as an indicator that there is a possible conflict that was resolved 
using the wrong data (business process wise). All nodes in the cluster 
will end up using the same wrong row, so at least they are consistently 
wrong. Nevertheless, being able to identify possible problem cases this 
way will allow to initiate further action including but not limited to 
human intervention.


If this is not an acceptable risk for the application, other resolution 
methods will be needed. But I think in many cases, this form of default 
resolution will be good enough.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

Richard Troy wrote:

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
  

Richard Troy wrote:


In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.
  

I understand that you're just thinking aloud, but overloading namespaces
in this way strikes me as awful. Applications and extensions, which are
the things that have need of namespaces, should not have to care about
replication. If we have to design them for replication we'll be on a
fast track to nowhere IMNSHO.



Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. 


No, I don't think of it as either. It's a utility, more an extension of 
the DBMS than of the application. You don't replicate for the sake of 
replicating.



I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around.


That's exactly what you would be doing with this proposal, encroaching 
on what I regard as user space.




 Either way,
claiming need  like this strikes me as stuck-in-a-rut or dogmatic
thinking. Besides, don't we have schema nesting to help resolve any such
care? 


No. We do now have schema nesting, for this or any other purpose. Where 
did you get that idea? If we did I would not be so resistant to using 
them for this purpose, but as it is, if you hijack schemas for 
replication segregation you will detract from their more obvious use in 
name segregation.


cheers

andrew



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Andrew Dunstan

I wrote:

We do now have schema nesting, for this or any other purpose.


s/now/not/ (of course)

cheers

andrew


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-09 Thread Jan Wieck

On 2/9/2007 3:25 PM, Andrew Dunstan wrote:

Richard Troy wrote:

On Fri, 9 Feb 2007, Andrew Dunstan wrote:
  

Richard Troy wrote:


In more specific terms, and I'm just brainstorming in public here, perhaps
we can use the power of Schemas within a database to manage such
divisions; commands which pertain to replication can/would include a
schema specifier and elements within the schema can be replicated one way
or another, at the whim of the DBA / Architect. For backwards
compatability, if a schema isn't specified, it indicates that command
pertains to the entire database.
  

I understand that you're just thinking aloud, but overloading namespaces
in this way strikes me as awful. Applications and extensions, which are
the things that have need of namespaces, should not have to care about
replication. If we have to design them for replication we'll be on a
fast track to nowhere IMNSHO.



Well, Andrew, replication _is_ an application. Or, you could think of
replication as an extension to an application. 


No, I don't think of it as either. It's a utility, more an extension of 
the DBMS than of the application. You don't replicate for the sake of 
replicating.



I was under the impression
that_users_ decide to put tables in schema spaces based upon _user_ need,
and that Postgres developer's use of them for other purposes was
incroaching on user choices, not the other way around.


That's exactly what you would be doing with this proposal, encroaching 
on what I regard as user space.


I'd never use a schema for that. Look at Slony-I. It uses a user 
selectable schema for the stuff it needs to store in the database, to 
keep Slony objects separate from user objects. But it organizes the user 
tables in what is called sets. A set can be any combination of tables 
and sequences from any number of namespaces. If I would do it again, I 
would use names instead of numbers to identify sets, and I would allow 
for summary sets containing groups of simple sets.


On the other issue, replication is part of the overall infrastructure 
and thereby part of the solution to a business problem. Like the 
hardware, database and application it solves a piece of the puzzle, none 
of the other is particularly good at. It will perform better or 
eventually not at all, depending on how much consideration the 
distributed nature of the business model was given when the application 
was designed. Applications that happened by accident rather than being 
designed usually don't allow any kind of distribution.


And yes, there are people who replicate for the sake of it. It is the 
kind of people who ask for sync multi-master no matter what their actual 
problem might be, because it sounds most prestigious.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Zeugswetter Andreas ADI SD

 Yes, yes, and yes ... but aside from the problem that you use the very

 ambiguous word timestamp (which somehow suggests using a clock of 
 some sort), isn't the begin timestamp of a long running transaction
 
imho a begin timestamp is near useless

 worse than the commit timestamp, when all its work got visible to
the 
 outside world instantaneously?

This is one of the areas I am still worried about. Is one commit lamport
timestamp enough ? 
I think for some conflict resolutions we need to look at the
row level, and resolve conflicts per row and not per transaction 
(yes, this means that a tx might get partially replicated).

What I am trying to lead at is: maybe an infrastructure to produce
wieck lamport timestamps, that can be used in different places like
commit hooks and column defaults, would be of more general use. Maybe
such
a column could be a system column that is not visible with select *
for those cases where commit is not enough. And a commit hook could
insert it into clog like storage.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Jan Wieck

On 2/7/2007 11:12 PM, Bruce Momjian wrote:

Jan Wieck wrote:

On 2/7/2007 10:35 PM, Bruce Momjian wrote:
 I find the term logical proof of it's correctness too restrictive.  It
 sounds like some formal academic process that really doesn't work well
 for us.

Thank you.

 Also, I saw the trigger patch with no explaination of why it was
 important or who would use it --- that also isn't going to fly well.

You didn't respond to my explanation how the current Slony 
implementation could improve and evolve using it. Are you missing 
something? I am discussing this very issue with our own QA department, 
and thus far, I think I have a majority of would use a pg_trigger 
backpatched PostgreSQL vs. No, I prefer a system that knows exactly 
how it corrupted my system catalog.


No, I _now_ understand the use case, but when the patch was posted, the
use case was missing.  I would like to see a repost with the patch, and
a description of its use so we can all move forward on that.


Is this a new policy that after discussion, all patches must be 
resubmitted with a summary and conclusions of the discussion? I can 
certainly do that for you, but just tell me if you are going to ask the 
same from everyone.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Bruce Momjian
Jan Wieck wrote:
 On 2/7/2007 11:12 PM, Bruce Momjian wrote:
  Jan Wieck wrote:
  On 2/7/2007 10:35 PM, Bruce Momjian wrote:
   I find the term logical proof of it's correctness too restrictive.  It
   sounds like some formal academic process that really doesn't work well
   for us.
  
  Thank you.
  
   Also, I saw the trigger patch with no explaination of why it was
   important or who would use it --- that also isn't going to fly well.
  
  You didn't respond to my explanation how the current Slony 
  implementation could improve and evolve using it. Are you missing 
  something? I am discussing this very issue with our own QA department, 
  and thus far, I think I have a majority of would use a pg_trigger 
  backpatched PostgreSQL vs. No, I prefer a system that knows exactly 
  how it corrupted my system catalog.
  
  No, I _now_ understand the use case, but when the patch was posted, the
  use case was missing.  I would like to see a repost with the patch, and
  a description of its use so we can all move forward on that.
 
 Is this a new policy that after discussion, all patches must be 
 resubmitted with a summary and conclusions of the discussion? I can 
 certainly do that for you, but just tell me if you are going to ask the 
 same from everyone.

No, I am asking only this time because I feel there was too much
disconnect between the patch and the extensive replication discussion
that few community members would see the connection.

I would also like to know what your new features does for each supported
option.  I have not seen that spelled out yet at all.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Alvaro Herrera
Bruce Momjian wrote:
 Jan Wieck wrote:
  On 2/7/2007 11:12 PM, Bruce Momjian wrote:
   Jan Wieck wrote:
   On 2/7/2007 10:35 PM, Bruce Momjian wrote:
  
Also, I saw the trigger patch with no explaination of why it was
important or who would use it --- that also isn't going to fly well.
   
   You didn't respond to my explanation how the current Slony 
   implementation could improve and evolve using it. Are you missing 
   something? I am discussing this very issue with our own QA department, 
   and thus far, I think I have a majority of would use a pg_trigger 
   backpatched PostgreSQL vs. No, I prefer a system that knows exactly 
   how it corrupted my system catalog.
   
   No, I _now_ understand the use case, but when the patch was posted, the
   use case was missing.  I would like to see a repost with the patch, and
   a description of its use so we can all move forward on that.
  
  Is this a new policy that after discussion, all patches must be 
  resubmitted with a summary and conclusions of the discussion? I can 
  certainly do that for you, but just tell me if you are going to ask the 
  same from everyone.
 
 No, I am asking only this time because I feel there was too much
 disconnect between the patch and the extensive replication discussion
 that few community members would see the connection.

FYI, in my opinion the trigger addition is clearly useful to Mammoth
Replicator as well.  In fact, it's so obviously useful that I didn't see
a need to state that in the original thread where it was discussed.

Not sure about the timestamp stuff, because Replicator is not
multi-master, so there's no conflict resolution to take care of.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Bruce Momjian
Alvaro Herrera wrote:
   Is this a new policy that after discussion, all patches must be 
   resubmitted with a summary and conclusions of the discussion? I can 
   certainly do that for you, but just tell me if you are going to ask the 
   same from everyone.
  
  No, I am asking only this time because I feel there was too much
  disconnect between the patch and the extensive replication discussion
  that few community members would see the connection.
 
 FYI, in my opinion the trigger addition is clearly useful to Mammoth
 Replicator as well.  In fact, it's so obviously useful that I didn't see
 a need to state that in the original thread where it was discussed.

Right, I know it is useful too, but I would like a layout of what it
does and why so everyone is clear on it.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Jan Wieck

On 2/8/2007 3:32 PM, Bruce Momjian wrote:

Alvaro Herrera wrote:
  Is this a new policy that after discussion, all patches must be 
  resubmitted with a summary and conclusions of the discussion? I can 
  certainly do that for you, but just tell me if you are going to ask the 
  same from everyone.
 
 No, I am asking only this time because I feel there was too much

 disconnect between the patch and the extensive replication discussion
 that few community members would see the connection.

FYI, in my opinion the trigger addition is clearly useful to Mammoth
Replicator as well.  In fact, it's so obviously useful that I didn't see
a need to state that in the original thread where it was discussed.


Right, I know it is useful too, but I would like a layout of what it
does and why so everyone is clear on it.


I have no clue what got you into what you are doing here. But that shall 
not be my real concern. If you feel the need to do this sort of thing, 
be my guest. I will add the remaining changes to pg_rewrite, including 
the new support commands and changes to psql as well as pg_dump and 
resubmit the new patch with explanations that will hopefully help you to 
comprehend what and how this relatively small and fully backward 
compatible change in the trigger and rule firing mechanism will work and 
what existing problems it will solve.



Regards,
Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Bruce Momjian
Jan Wieck wrote:
 On 2/8/2007 3:32 PM, Bruce Momjian wrote:
  Alvaro Herrera wrote:
Is this a new policy that after discussion, all patches must be 
resubmitted with a summary and conclusions of the discussion? I can 
certainly do that for you, but just tell me if you are going to ask 
the 
same from everyone.
   
   No, I am asking only this time because I feel there was too much
   disconnect between the patch and the extensive replication discussion
   that few community members would see the connection.
  
  FYI, in my opinion the trigger addition is clearly useful to Mammoth
  Replicator as well.  In fact, it's so obviously useful that I didn't see
  a need to state that in the original thread where it was discussed.
  
  Right, I know it is useful too, but I would like a layout of what it
  does and why so everyone is clear on it.
 
 I have no clue what got you into what you are doing here. But that shall 
 not be my real concern. If you feel the need to do this sort of thing, 
 be my guest. I will add the remaining changes to pg_rewrite, including 
 the new support commands and changes to psql as well as pg_dump and 
 resubmit the new patch with explanations that will hopefully help you to 
 comprehend what and how this relatively small and fully backward 
 compatible change in the trigger and rule firing mechanism will work and 
 what existing problems it will solve.

Yep, that's what I want everyone to see.  This is standard procedure for
everyone in the community, and core is not immune.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Joshua D. Drake
Jan Wieck wrote:
 On 2/8/2007 3:32 PM, Bruce Momjian wrote:
 Alvaro Herrera wrote:
   Is this a new policy that after discussion, all patches must be 
  resubmitted with a summary and conclusions of the discussion? I can
   certainly do that for you, but just tell me if you are going to
 ask the   same from everyone.
   No, I am asking only this time because I feel there was too much
  disconnect between the patch and the extensive replication discussion
  that few community members would see the connection.

 FYI, in my opinion the trigger addition is clearly useful to Mammoth
 Replicator as well.  In fact, it's so obviously useful that I didn't see
 a need to state that in the original thread where it was discussed.

 Right, I know it is useful too, but I would like a layout of what it
 does and why so everyone is clear on it.

Well how deep are we talking here? My understanding of what Jan wants to
do is simple.

Be able to declare which triggers are fired depending on the state of
the cluster.

In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
or Slave.

This is useful because I may have a trigger on the Master and the same
trigger on the Slave. You do not want the trigger to fire on the Slave
because we are doing data replication. In short, the we replicate the
result, not the action.

However, you may want triggers that are on the Slave to fire separately.
A reporting server that generates materialized views is a good example.
Don't tie up the Master with what a Slave can do.

Sincerely,

Joshua D. Drake



 
 I have no clue what got you into what you are doing here. But that shall
 not be my real concern. If you feel the need to do this sort of thing,
 be my guest. I will add the remaining changes to pg_rewrite, including
 the new support commands and changes to psql as well as pg_dump and
 resubmit the new patch with explanations that will hopefully help you to
 comprehend what and how this relatively small and fully backward
 compatible change in the trigger and rule firing mechanism will work and
 what existing problems it will solve.
 
 
 Regards,
 Jan
 


-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Bruce Momjian

I just want an outline of what each option is supposed to control.  If
that information is in a documentation patch, then fine, he can just
post that and tell people to read the patch documentation.

---

Joshua D. Drake wrote:
 Jan Wieck wrote:
  On 2/8/2007 3:32 PM, Bruce Momjian wrote:
  Alvaro Herrera wrote:
Is this a new policy that after discussion, all patches must be 
   resubmitted with a summary and conclusions of the discussion? I can
certainly do that for you, but just tell me if you are going to
  ask the   same from everyone.
No, I am asking only this time because I feel there was too much
   disconnect between the patch and the extensive replication discussion
   that few community members would see the connection.
 
  FYI, in my opinion the trigger addition is clearly useful to Mammoth
  Replicator as well.  In fact, it's so obviously useful that I didn't see
  a need to state that in the original thread where it was discussed.
 
  Right, I know it is useful too, but I would like a layout of what it
  does and why so everyone is clear on it.
 
 Well how deep are we talking here? My understanding of what Jan wants to
 do is simple.
 
 Be able to declare which triggers are fired depending on the state of
 the cluster.
 
 In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
 or Slave.
 
 This is useful because I may have a trigger on the Master and the same
 trigger on the Slave. You do not want the trigger to fire on the Slave
 because we are doing data replication. In short, the we replicate the
 result, not the action.
 
 However, you may want triggers that are on the Slave to fire separately.
 A reporting server that generates materialized views is a good example.
 Don't tie up the Master with what a Slave can do.
 
 Sincerely,
 
 Joshua D. Drake
 
 
 
  
  I have no clue what got you into what you are doing here. But that shall
  not be my real concern. If you feel the need to do this sort of thing,
  be my guest. I will add the remaining changes to pg_rewrite, including
  the new support commands and changes to psql as well as pg_dump and
  resubmit the new patch with explanations that will hopefully help you to
  comprehend what and how this relatively small and fully backward
  compatible change in the trigger and rule firing mechanism will work and
  what existing problems it will solve.
  
  
  Regards,
  Jan
  
 
 
 -- 
 
   === The PostgreSQL Company: Command Prompt, Inc. ===
 Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
 Providing the most comprehensive  PostgreSQL solutions since 1997
  http://www.commandprompt.com/
 
 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
 PostgreSQL Replication: http://www.commandprompt.com/products/
 
 
 ---(end of broadcast)---
 TIP 2: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-08 Thread Richard Troy

On Thu, 8 Feb 2007, Joshua D. Drake wrote:

 Well how deep are we talking here? My understanding of what Jan wants to
 do is simple.

 Be able to declare which triggers are fired depending on the state of
 the cluster.

 In Jan's terms, the Origin or Subscriber. In Replicator terms the Master
 or Slave.

 This is useful because I may have a trigger on the Master and the same
 trigger on the Slave. You do not want the trigger to fire on the Slave
 because we are doing data replication. In short, the we replicate the
 result, not the action.

 However, you may want triggers that are on the Slave to fire separately.
 A reporting server that generates materialized views is a good example.
 Don't tie up the Master with what a Slave can do.


It'd be great if Jan considers the blending of replication; any given DB
instance shouldn't be only a master/originator or only a slave/subscriber.
A solution that lets you blend replication strategies in a single db is,
from my point of view, very important.

  I have no clue what got you into what you are doing here.

Jan, some sleep now and then might be helpful to your public disposition.
-smile-

peace,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread José Orlando Pereira
On Saturday 03 February 2007, Bruce Momjian wrote:
 Jan Wieck wrote:
  I don't have any such paper and the proof of concept will be the
  implementation of the system. I do however see enough resistance against
  this proposal to withdraw the commit timestamp at this time. The new
  replication system will therefore require the installation of a patched,
  non-standard PostgreSQL version, compiled from sources cluster wide in
  order to be used. I am aware that this will dramatically reduce it's
  popularity but it is impossible to develop this essential feature as an
  external module.
 
  I thank everyone for their attention.

 Going and working on it on your own doesn't seem like the proper
 solution.  I don't see people objecting to adding it, but they want it
 work, which I am sure you want too.  You have to show how it will work
 and convince others of that, and then you have a higher chance it will
 work, and be in the PostgreSQL codebase.

Hi,

Would it be possible to solve the problem using the GORDA on-commit hook?

Jan would be able reliably obtain a commit timestamp with the desired 
semantics and store it in a regular table within transaction boundaries.

PostgreSQL would not have to commit to a specific timestamp semantics and the 
patch is quite small.

Regards,

-- 
Jose Orlando Pereira

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Jan Wieck

On 2/7/2007 2:37 AM, Markus Schiltknecht wrote:

Hi,

Jan Wieck wrote:
Whatever strategy one will use, in an async multimaster there are always 
cases that can be resolved by rules (last update being one of them), and 
some that I can't even imagine solving so far. I guess some of the cases 
will simply boil down to the application has to make sure that ... 
never occurs. Think of a multi-item order, created on one node, while 
another node is deleting the long unused item (which would have to be 
backordered). Now while those two nodes figure out what to do to make 
this consistent again, a third node does a partial shipment of that 
order.


It helps to categorize these conflict types. There basically are:


Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster? It seems more like we are drifting into what type of
replication system I should design to please most people.


Jan




* data conflicts: simple row data, i.e. update - update conflicts.

* uniqueness conflicts: two rows conflict because they'd violate a 
uniquenes constraint, i.e. insert - insert, update - insert or update - 
update.


* visibility conflicts: basically the remaining update - delete and 
delete - delete cases. But also SELECT FOR UPDATE candidates, etc... 
Everything having to do with a rows not yet or no longer being visible 
to a transaction.


Your example certainly involves a visibility conflict (update - delete). 
Not even (sync) Postgres-R can guarantee consistency on the visibility 
level, i.e.  a first transaction's SELECT FOR UPDATE might not see some 
just recently committed transactions newly inserted rows (because that 
one isn't replayed yet on the node, thus the transaction is working on 
an 'old' snapshot of the database state). Another simpler example: 
Postgres-R doesn't raise a serialization error on delete-delete 
conflicts, it simply deletes the row once, even if two transactions 
confirmed to have committed a transaction which deleted a row.


Luckily, most applications don't need that anyway, though.

The solution is simple, reinsert the deleted item ... 


..at which point timestamps certainly won't help :-)   Sorry, couldn't 
resist...


only that 
there were rather nasty ON DELETE CASCADE's on that item that removed 
all the consumer reviews, product descriptions, data sheets and what 
not. It's going to be an awful lot of undo.


Huh? Are you planning on aborting *parts* of a transaction? I didn't 
think about that, but my gut feeling is that you don't want to do that.


I haven't really made up my mind about a user defined rule based 
conflict resolution interface yet. I do plan to have a unique and 
foreign key constraint based, synchronous advisory locking system on top 
of my system in a later version (advisory key locks would stay in place 
until the transaction, that placed them, replicates).


You'd have to elaborate on that...

I guess you see by now why I wanted to keep the discussion about the 
individual, rather generic support features in the backend separate from 
the particular features I plan to implement in the replication system.


Sure. I know, discussions about replication can get endless, probably 
even are so by definition ;-)  But hey, they're fun!


Everyone has different needs and consequently an async multi-master 
must do a whole range of mutually exclusive things altogether ... 
because Postgres can never accept a partial solution. We want the egg 
laying milk-wool-pig or nothing.


Like the one which would result from a merge of such an async 
replication with a sync one? Imagine being able to choose between sync 
and async per transaction...


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org



--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:

Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster?


No. And I think you know my opinion about that by now. ;-)


It seems more like we are drifting into what type of
replication system I should design to please most people.


Nobody is telling you what you should do. You're free to do whatever you 
want to.


I'm only trying to get a discussion going, because a) I'm interested in 
how you plan to solve these problems and b) in the past, most people 
were complaining that all the different replication efforts didn't try 
to work together. I'm slowly trying to open up and discuss what I'm 
doing with Postgres-R on the lists.


Just yesterday at the SFPUG meeting, I've experienced how confusing it 
is for the users to have such a broad variety of (existing and upcoming) 
replication solutions. And I'm all for working together and probably 
even for merging different replication solutions.


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Richard Troy

 Jan Wieck wrote:
  Are we still discussing if the Postgres backend may provide support for
  a commit timestamp, that follows the rules for Lamport timestamps in a
  multi-node cluster?

...I thought you said in this thread that you haven't and weren't going to
work on any kind of logical proof of it's correctness, saw no value in
prototyping your way to a clear (convincing) argument, and were
withdrawing the proposal due to all the issues others raised which were,
in light of this, unanswerable beyond conjecture. I thought that the
thread was continuing because other people saw value in the kernel of the
idea, would support if if it could be shown to be correct/useful, were
disappointed you'd leave it at that and wanted to continue to see if
something positive might come of the dialogue. So, the thread weaved
around a bit. I think that if you want to nail this down, people here are
willing to be convinced, but that hasn't happened yet.

On Wed, 7 Feb 2007, Markus Schiltknecht wrote:
 I'm only trying to get a discussion going, because a) I'm interested in
 how you plan to solve these problems and b) in the past, most people
 were complaining that all the different replication efforts didn't try
 to work together. I'm slowly trying to open up and discuss what I'm
 doing with Postgres-R on the lists.

 Just yesterday at the SFPUG meeting, I've experienced how confusing it
 is for the users to have such a broad variety of (existing and upcoming)
 replication solutions. And I'm all for working together and probably
 even for merging different replication solutions.

In support of that idea, I offer this; When Randy Eash wrote the world's
first replication system for Ingres circa 1990, his work included ideas
and features that are right now in the Postgres world fragmented among
several existing replication / replication-related products, along with
some things that are only now in discussion in this group. As discussed at
the SFPUG meeting last night, real-world use cases are seldom if ever
completely satisfied with a one-size-fits-all replication strategy. For
example, a manufacturing company might want all factories to be capable of
being autonomous but both report activities and take direction from
corporate headquarters. To do this without having multiple databases at
each site, a single database instance would likely be both a master and
slave, but for differing aspects of the businesses needs. Business
decisions would resolve the conflicts, say, the manufacturing node always
wins when it comes to data that pertains to their work, rather than
something like a time-stamp, last timestamp/serialized update wins.

Like Markus, I would like to see the various replication efforts merged as
best they can be because even if the majority of users don't use a little
bit of everything, surely the more interesting cases would like to and the
entire community is better served if the various solutions are in
harmony.

Richard


-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Jan Wieck

On 2/7/2007 12:54 PM, Markus Schiltknecht wrote:

Hi,

Jan Wieck wrote:

Are we still discussing if the Postgres backend may provide support for
a commit timestamp, that follows the rules for Lamport timestamps in a
multi-node cluster?


No. And I think you know my opinion about that by now. ;-)


Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things) 
is a two node multimaster system. One is serving the customer web 
portal, the other is used by the company staff including the call 
center. At 13:45 the two servers lose connectivity to each other, yet 
the internal staff can access the internal server while the web portal 
is accessible from the outside. At 13:50 customer A updates their credit 
card information through the web portal, while customer B does the same 
through the call center. At 13:55 both customers change their mind to 
use yet another credit card, now customer A phones the call center while 
customer B does it via the internet.


At 14:00 the two servers reconnect and go through the conflict 
resolution. How do you intend to solve both conflicts without using any 
clock, because that seems to be a stopword causing instant rejection 
of whatever you propose. Needless to say, both customers will be 
dissatisfied if you charge the wrong credit card during your next 
billing cycle.






It seems more like we are drifting into what type of
replication system I should design to please most people.


Nobody is telling you what you should do. You're free to do whatever you 
want to.


I'm only trying to get a discussion going, because a) I'm interested in 
how you plan to solve these problems and b) in the past, most people 
were complaining that all the different replication efforts didn't try 
to work together. I'm slowly trying to open up and discuss what I'm 
doing with Postgres-R on the lists.


Which is a good discussion because one of the reasons why I stopped 
looking into Postgres-R is the fact that is based on the idea to push 
all the replication information through a system that generates a global 
serialized message queue. That by itself isn't the problem, but the fact 
that implementing a global serialized message queue has serious 
throughput issues that are (among other details) linked to the speed of 
light.


I am trying to start with a system, that doesn't rely on such a 
mechanism for everything. I do intend to add an option later, that 
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
will require the node to currently be a member of the (quorum or 
priority defined) majority of the cluster. An advisory lock system, 
based on a total order group communication, will grant the lock to the 
unique key values on a first come, first serve base. Every node in the 
cluster will keep those keys as locked until the asynchronous 
replication stream reports the locking transaction as ended. If another 
remote transaction in the meantime requires updating such key, the 
incoming stream from that node will be on hold until the lock is 
cleared. This is to protect agains node B replicating a transaction from 
node A and a later update on node B arrives on C before C got the first 
event from A. A node that got disconnected from the cluster must rebuild 
the current advisory lock list upon reconnecting to the cluster.


I think that this will be a way to overcome Postgres-R's communication 
bottleneck, as well as allowing limited update activity even during a 
completely disconnected state of a node. Synchronous or group 
communication messages are reduced to the cases, where the application 
cannot be implemented in a conflict free way, like allocating a natural 
primary key. There is absolutely no need to synchronize for example 
creating a sales order. An application can use global unique ID's for 
the order number. And everything possibly referenced by an order (items, 
customers, ...) is stored in a way that the references are never 
updated. Deletes to those possibly referenced objects are implemented in 
a two step process, where they are first marked obsolete, and later on 
things that have been marked obsolete for X long are deleted. A REPLICA 
TRIGGER on inserting an order will simply reset the obsolete flag of 
referenced objects. If a node is disconnected longer than X, you have a 
problem - hunt down the guy who defined X.


Just yesterday at the SFPUG meeting, I've experienced how confusing it 
is for the users to have such a broad variety of (existing and upcoming) 
replication solutions. And I'm all for working together and probably 
even for merging different replication solutions.


Merging certain ideas to come up with an async/sync hybrid? Seems to me 
we have similar enough ideas to need conflict resolution, because we had 
them simultaneously but communicate them asynchronously.



Jan

--

Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things) 
is a two node multimaster system. One is serving the customer web 
portal, the other is used by the company staff including the call 
center. At 13:45 the two servers lose connectivity to each other, yet 
the internal staff can access the internal server while the web portal 
is accessible from the outside. At 13:50 customer A updates their credit 
card information through the web portal, while customer B does the same 
through the call center. At 13:55 both customers change their mind to 
use yet another credit card, now customer A phones the call center while 
customer B does it via the internet.


Phew, a mind twister... one customer would already be enough to trigger 
that sort of conflict...


At 14:00 the two servers reconnect and go through the conflict 
resolution. How do you intend to solve both conflicts without using any 
clock, because that seems to be a stopword causing instant rejection 
of whatever you propose. Needless to say, both customers will be 
dissatisfied if you charge the wrong credit card during your next 
billing cycle.


Correct. But do these cases satisfy storing timestamps to each and every 
transaction you do? That's what I doubt, not the usefulness of time 
based conflict resolution for certain cases.


You can always add a time based conflict resolution, by adding a 
timestamp column and decide upon that one. I'd guess that the overall 
costs are lower that way.


But you've withdrawn that proposal already, so...

Which is a good discussion because one of the reasons why I stopped 
looking into Postgres-R is the fact that is based on the idea to push 
all the replication information through a system that generates a global 
serialized message queue. That by itself isn't the problem, but the fact 
that implementing a global serialized message queue has serious 
throughput issues that are (among other details) linked to the speed of 
light.


Agreed. Nevertheless, there are use cases for such systems, because they 
put less limitations to the application. One could even argue, that your 
above example would be one ;-)


I am trying to start with a system, that doesn't rely on such a 
mechanism for everything. I do intend to add an option later, that 
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
will require the node to currently be a member of the (quorum or 
priority defined) majority of the cluster.


Sounds reasonable.

An advisory lock system, 
based on a total order group communication, will grant the lock to the 
unique key values on a first come, first serve base. Every node in the 
cluster will keep those keys as locked until the asynchronous 
replication stream reports the locking transaction as ended. If another 
remote transaction in the meantime requires updating such key, the 
incoming stream from that node will be on hold until the lock is 
cleared. This is to protect agains node B replicating a transaction from 
node A and a later update on node B arrives on C before C got the first 
event from A. A node that got disconnected from the cluster must rebuild 
the current advisory lock list upon reconnecting to the cluster.


Yeah, this is a convenient way to replicate sequences via a GCS.

I think that this will be a way to overcome Postgres-R's communication 
bottleneck, as well as allowing limited update activity even during a 
completely disconnected state of a node. Synchronous or group 
communication messages are reduced to the cases, where the application 
cannot be implemented in a conflict free way, like allocating a natural 
primary key. There is absolutely no need to synchronize for example 
creating a sales order. 


Agreed, such cases can easily be optimized. But you have to be aware of 
he limitations these optimizations cause. Postgres-R is much more 
targeted at very general use cases.


An application can use global unique ID's for 
the order number. And everything possibly referenced by an order (items, 
customers, ...) is stored in a way that the references are never 
updated. Deletes to those possibly referenced objects are implemented in 
a two step process, where they are first marked obsolete, and later on 
things that have been marked obsolete for X long are deleted. A REPLICA 
TRIGGER on inserting an order will simply reset the obsolete flag of 
referenced objects. If a node is disconnected longer than X, you have a 
problem - hunt down the guy who defined X.


Yeah, that's another very nice optimization. Again, as long as you know 
the limitations, that's all well and fine.


Merging certain ideas to come up with an async/sync hybrid? Seems to me 
we have similar enough ideas to need conflict resolution, because we had 
them simultaneously but communicate them asynchronously.


Huh? 

Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Jan Wieck

On 2/7/2007 2:15 PM, Richard Troy wrote:

Jan Wieck wrote:
 Are we still discussing if the Postgres backend may provide support for
 a commit timestamp, that follows the rules for Lamport timestamps in a
 multi-node cluster?


...I thought you said in this thread that you haven't and weren't going to
work on any kind of logical proof of it's correctness, saw no value in
prototyping your way to a clear (convincing) argument, and were
withdrawing the proposal [...]


I said I don't have any such documents. I was asked to continue this 
discussion in order to find people willing to help discover potential 
problems. I am prepared to continue this development isolated, although 
I wouldn't like to.


The PostgreSQL developers community used to be good at throwing out 
ideas, brainstorming about the possibilities, adding more to them and 
coming up with very unique and flexible solutions. I am a little 
disappointed that much of that got lost over the years and please 
forgive me if I sound a little grumpy over that. The statement to 
withdraw the proposal was certainly premature - consider it not 
withdrawn at this time. However, comparing what used to be our process 
to what I see today, I must say that something like TOAST would never 
have happened. It was the result of a global brainstorming, that I 
simply translated into C code. Many details and features of the 
implementation are purely mine, but the really big sparks, that got it 
to what it is, I'm not claiming for myself. Most importantly, give me 
proof of concept before we can talk about changing backend code was not 
part of the process at all. We were pretty eager to change things back 
then, when we needed to get better in almost every way possible ... are 
we so good at replication that we need to be conservative in that 
respect now? We are certainly good at some things and have to be 
conservative with respect to them, but replication in my not so very 
humble opinion isn't one of them.


I do understand that we have a codebase used in production these days. 
And because of that we have to maintain code and syntax stability to a 
degree, we didn't have back in the glory days of introducing EXCEPT and 
INTERCEPT (who's first incarnation was committed to the code base while 
completely destroying my entire work of fixing the rewriter). Maybe we 
need to introduce something entirely different, like the concept of an 
experimental feature. Something that we add to the code but that is 
explicitly flagged as not final, not stable, not guaranteed to stay or 
work in this or any other form. This requires that the feature has very 
limited interference with other parts of the system, like (or especially 
like) the query parser. If it turns out to be a problem in x.y.0, it 
will be backed out and gone in x.y.1. Or in a different way, like we 
create an experimental CVS branch off of every major release. That way, 
developers can easier share experimental code and if things settle 
there, they will be considered to be adopted into HEAD.



Like Markus, I would like to see the various replication efforts merged as
best they can be because even if the majority of users don't use a little
bit of everything, surely the more interesting cases would like to and the
entire community is better served if the various solutions are in
harmony.


No doubt about that and I was the one organizing the Afilias sponsored 
meeting in Toronto back then, where my reversed Postgres-R idea was 
taken apart because it won't scale due to the gigantic amount of 
synchronized group communication it would require. Again, it might be 
that experimental features will cause more of the efforts to converge by 
using the same base as a compromise instead of having each and every 
support feature being designed completely independent.


I still have a hard time understanding why someone would object to 
adding a feature, however useless it might seem to them, as long as it 
doesn't cost them anything. Admitted, any feature causes maintenance 
costs on the side of the PostgreSQL development community (mainly those, 
who actually contribute and maintain the code - fortunately that is a 
finite number - everyone please ask themselves if they are part of 
that). But aside from that, would anyone, who is questioning the commit 
timestamp as I proposed it, likewise vehemently object to yet another 
procedural language, or adding another log tuning switch? I don't think 
so. As long as it doesn't cost you unless you turn it on, why would you 
even care if it serves my purpose or not? The thing that kicked off this 
emotional spin was that multimaster replication is what so many people 
want, but nobody has a universal solution for. Everyone wants to see 
their problem solved as well, or the solution isn't good. Tell you 
what, I can live with my problem solved even if it doesn't solve yours. 
Can you tell me what I have to modify in order to solve your problem as 
well, or are you asking 

Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Jan Wieck

On 2/7/2007 9:27 PM, Markus Schiltknecht wrote:

Hi,

Jan Wieck wrote:

Then let me give you a little puzzle just for the fun of it.

A database containing customer contact information (among other things) 
is a two node multimaster system. One is serving the customer web 
portal, the other is used by the company staff including the call 
center. At 13:45 the two servers lose connectivity to each other, yet 
the internal staff can access the internal server while the web portal 
is accessible from the outside. At 13:50 customer A updates their credit 
card information through the web portal, while customer B does the same 
through the call center. At 13:55 both customers change their mind to 
use yet another credit card, now customer A phones the call center while 
customer B does it via the internet.


Phew, a mind twister... one customer would already be enough to trigger 
that sort of conflict...


At 14:00 the two servers reconnect and go through the conflict 
resolution. How do you intend to solve both conflicts without using any 
clock, because that seems to be a stopword causing instant rejection 
of whatever you propose. Needless to say, both customers will be 
dissatisfied if you charge the wrong credit card during your next 
billing cycle.


Correct. But do these cases satisfy storing timestamps to each and every 
transaction you do? That's what I doubt, not the usefulness of time 
based conflict resolution for certain cases.


You can always add a time based conflict resolution, by adding a 
timestamp column and decide upon that one. I'd guess that the overall 
costs are lower that way.


Yes, yes, and yes ... but aside from the problem that you use the very 
ambiguous word timestamp (which somehow suggests using a clock of 
some sort), isn't the begin timestamp of a long running transaction 
worse than the commit timestamp, when all its work got visible to the 
outside world instantaneously?




But you've withdrawn that proposal already, so...

Which is a good discussion because one of the reasons why I stopped 
looking into Postgres-R is the fact that is based on the idea to push 
all the replication information through a system that generates a global 
serialized message queue. That by itself isn't the problem, but the fact 
that implementing a global serialized message queue has serious 
throughput issues that are (among other details) linked to the speed of 
light.


Agreed. Nevertheless, there are use cases for such systems, because they 
put less limitations to the application. One could even argue, that your 
above example would be one ;-)


Now we're in sync :-)



I am trying to start with a system, that doesn't rely on such a 
mechanism for everything. I do intend to add an option later, that 
allows to declare a UNIQUE NOT NULL constraint to be synchronous. What 
that means is, that any INSERT, UPDATE, DELETE and SELECT FOR UPDATE 
will require the node to currently be a member of the (quorum or 
priority defined) majority of the cluster.


Sounds reasonable.

An advisory lock system, 
based on a total order group communication, will grant the lock to the 
unique key values on a first come, first serve base. Every node in the 
cluster will keep those keys as locked until the asynchronous 
replication stream reports the locking transaction as ended. If another 
remote transaction in the meantime requires updating such key, the 
incoming stream from that node will be on hold until the lock is 
cleared. This is to protect agains node B replicating a transaction from 
node A and a later update on node B arrives on C before C got the first 
event from A. A node that got disconnected from the cluster must rebuild 
the current advisory lock list upon reconnecting to the cluster.


Yeah, this is a convenient way to replicate sequences via a GCS.

I think that this will be a way to overcome Postgres-R's communication 
bottleneck, as well as allowing limited update activity even during a 
completely disconnected state of a node. Synchronous or group 
communication messages are reduced to the cases, where the application 
cannot be implemented in a conflict free way, like allocating a natural 
primary key. There is absolutely no need to synchronize for example 
creating a sales order. 


Agreed, such cases can easily be optimized. But you have to be aware of 
he limitations these optimizations cause. Postgres-R is much more 
targeted at very general use cases.


I am, if for no other reason than that I am familiar with the concepts 
underneath Postgres-R for more than 3 years. What I realized is that the 
 general use case (for arbitrary complex applications) is very likely 
to be in conflict with any king of good default performance case.




An application can use global unique ID's for 
the order number. And everything possibly referenced by an order (items, 
customers, ...) is stored in a way that the references are never 
updated. Deletes to those possibly referenced objects are 

Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Bruce Momjian

I find the term logical proof of it's correctness too restrictive.  It
sounds like some formal academic process that really doesn't work well
for us.

What I did want to hear is a layout of how the system would work, and an
exchange of ideas until almost everyone was happy.

Also, I saw the trigger patch with no explaination of why it was
important or who would use it --- that also isn't going to fly well.

So, to add something, the community needs to hear how it is going to
help users, because every code addition has cost, and we don't want to
add things unless it has general utility.  If someone can't explain the
utility of an addition, I question whether the person has fully thought
through were they are going.

As far as adding a language, no, we would not just add any language.  We
would judge whether the language has usefulness to our users.  I think
APL would be cool, but I am not sure it is usable, so there is a hurdle
even there.

As far as TOAST, there is no question in my mind that TOAST development
would happen the same way today as it did when we did it in 2001 --- we
have a problem, how can we fix it.



---

Jan Wieck wrote:
 On 2/7/2007 2:15 PM, Richard Troy wrote:
  Jan Wieck wrote:
   Are we still discussing if the Postgres backend may provide support for
   a commit timestamp, that follows the rules for Lamport timestamps in a
   multi-node cluster?
  
  ...I thought you said in this thread that you haven't and weren't going to
  work on any kind of logical proof of it's correctness, saw no value in
  prototyping your way to a clear (convincing) argument, and were
  withdrawing the proposal [...]
 
 I said I don't have any such documents. I was asked to continue this 
 discussion in order to find people willing to help discover potential 
 problems. I am prepared to continue this development isolated, although 
 I wouldn't like to.
 
 The PostgreSQL developers community used to be good at throwing out 
 ideas, brainstorming about the possibilities, adding more to them and 
 coming up with very unique and flexible solutions. I am a little 
 disappointed that much of that got lost over the years and please 
 forgive me if I sound a little grumpy over that. The statement to 
 withdraw the proposal was certainly premature - consider it not 
 withdrawn at this time. However, comparing what used to be our process 
 to what I see today, I must say that something like TOAST would never 
 have happened. It was the result of a global brainstorming, that I 
 simply translated into C code. Many details and features of the 
 implementation are purely mine, but the really big sparks, that got it 
 to what it is, I'm not claiming for myself. Most importantly, give me 
 proof of concept before we can talk about changing backend code was not 
 part of the process at all. We were pretty eager to change things back 
 then, when we needed to get better in almost every way possible ... are 
 we so good at replication that we need to be conservative in that 
 respect now? We are certainly good at some things and have to be 
 conservative with respect to them, but replication in my not so very 
 humble opinion isn't one of them.
 
 I do understand that we have a codebase used in production these days. 
 And because of that we have to maintain code and syntax stability to a 
 degree, we didn't have back in the glory days of introducing EXCEPT and 
 INTERCEPT (who's first incarnation was committed to the code base while 
 completely destroying my entire work of fixing the rewriter). Maybe we 
 need to introduce something entirely different, like the concept of an 
 experimental feature. Something that we add to the code but that is 
 explicitly flagged as not final, not stable, not guaranteed to stay or 
 work in this or any other form. This requires that the feature has very 
 limited interference with other parts of the system, like (or especially 
 like) the query parser. If it turns out to be a problem in x.y.0, it 
 will be backed out and gone in x.y.1. Or in a different way, like we 
 create an experimental CVS branch off of every major release. That way, 
 developers can easier share experimental code and if things settle 
 there, they will be considered to be adopted into HEAD.
 
  Like Markus, I would like to see the various replication efforts merged as
  best they can be because even if the majority of users don't use a little
  bit of everything, surely the more interesting cases would like to and the
  entire community is better served if the various solutions are in
  harmony.
 
 No doubt about that and I was the one organizing the Afilias sponsored 
 meeting in Toronto back then, where my reversed Postgres-R idea was 
 taken apart because it won't scale due to the gigantic amount of 
 synchronized group communication it would require. Again, it might be 
 that experimental features will cause more of the efforts to converge 

Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Jan Wieck

On 2/7/2007 10:35 PM, Bruce Momjian wrote:

I find the term logical proof of it's correctness too restrictive.  It
sounds like some formal academic process that really doesn't work well
for us.


Thank you.


Also, I saw the trigger patch with no explaination of why it was
important or who would use it --- that also isn't going to fly well.


You didn't respond to my explanation how the current Slony 
implementation could improve and evolve using it. Are you missing 
something? I am discussing this very issue with our own QA department, 
and thus far, I think I have a majority of would use a pg_trigger 
backpatched PostgreSQL vs. No, I prefer a system that knows exactly 
how it corrupted my system catalog.



As far as TOAST, there is no question in my mind that TOAST development
would happen the same way today as it did when we did it in 2001 --- we
have a problem, how can we fix it.


Looking at what did happen back then and what happens in this case, I do 
see a difference. There were concerns about the compression algorithm 
used ... it still is today what was the first incarnation and nobody 
ever bothered to even investigate if there could possibly be any better 
thing. Do you think lzcompress is the best we can come up with? I don't! 
So why is it still the thing used? Maybe it is good enough?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-07 Thread Bruce Momjian
Jan Wieck wrote:
 On 2/7/2007 10:35 PM, Bruce Momjian wrote:
  I find the term logical proof of it's correctness too restrictive.  It
  sounds like some formal academic process that really doesn't work well
  for us.
 
 Thank you.
 
  Also, I saw the trigger patch with no explaination of why it was
  important or who would use it --- that also isn't going to fly well.
 
 You didn't respond to my explanation how the current Slony 
 implementation could improve and evolve using it. Are you missing 
 something? I am discussing this very issue with our own QA department, 
 and thus far, I think I have a majority of would use a pg_trigger 
 backpatched PostgreSQL vs. No, I prefer a system that knows exactly 
 how it corrupted my system catalog.

No, I _now_ understand the use case, but when the patch was posted, the
use case was missing.  I would like to see a repost with the patch, and
a description of its use so we can all move forward on that.

  As far as TOAST, there is no question in my mind that TOAST development
  would happen the same way today as it did when we did it in 2001 --- we
  have a problem, how can we fix it.
 
 Looking at what did happen back then and what happens in this case, I do 
 see a difference. There were concerns about the compression algorithm 
 used ... it still is today what was the first incarnation and nobody 
 ever bothered to even investigate if there could possibly be any better 
 thing. Do you think lzcompress is the best we can come up with? I don't! 
 So why is it still the thing used? Maybe it is good enough?

It is simple/stupid enough, I would say, and the compression space is a
mine-field of patents.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Zeugswetter Andreas ADI SD

 What I'm more concerned about, with Jan's proposal, is the assumption 
 that you always want to resolve conflicts by time (except for 
 balances, 
 for which we don't have much information, yet). I'd rather 

Um, I think the proposal was only for beneficial backend functionality
for replication in general and time based conflict resolution. And time
based
is surely one of the important conflict resolution methods for async MM
replication.

Sure there are others, like rule based priority based but I think
you don't need additional backend functionality for those.

Andreas

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Markus Schiltknecht

Hi,

Zeugswetter Andreas ADI SD wrote:

And time based
is surely one of the important conflict resolution methods for async MM
replication.


That's what I'm questioning. Wouldn't any other deterministic, but 
seemingly random abort decision be as clever as time based conflict 
resolution? It would then be clear to the user that it's random and not 
some in most cases time based, but no in others and only if... thing.



Sure there are others, like rule based priority based but I think
you don't need additional backend functionality for those.


Got the point, yes. I'm impatient, sorry.

Neither the less, I'm questioning if is it worth adding backend 
functionality for that. And given this probably is the most wanted 
resolution method, this question might be heretical. You could also 
see it as sort of an user educating question: don't favor time based 
resolution if that's the one resolution method with the most traps.


Regards

Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Jan Wieck

On 2/6/2007 11:44 AM, Markus Schiltknecht wrote:

Hi,

Zeugswetter Andreas ADI SD wrote:

And time based
is surely one of the important conflict resolution methods for async MM
replication.


That's what I'm questioning. Wouldn't any other deterministic, but 
seemingly random abort decision be as clever as time based conflict 
resolution? It would then be clear to the user that it's random and not 
some in most cases time based, but no in others and only if... thing.



Sure there are others, like rule based priority based but I think
you don't need additional backend functionality for those.


Got the point, yes. I'm impatient, sorry.

Neither the less, I'm questioning if is it worth adding backend 
functionality for that. And given this probably is the most wanted 
resolution method, this question might be heretical. You could also 
see it as sort of an user educating question: don't favor time based 
resolution if that's the one resolution method with the most traps.


These are all very good suggestions towards additional conflict 
resolution mechanisms, that solve one or the other problem. As we have 
said for years now, one size will not fit all. What I am after for the 
moment is a system that supports by default a last update wins on the 
row level, where last update certainly is a little fuzzy, but not by 
minutes. Plus balance type columns. A balance column is not propagated 
as a new value, but as a delta between the old and the new value. All 
replica will apply the delta to that column regardless of whether the 
replication info is newer or older than the existing row. That way, 
literal value type columns (like an address) will maintain cluster wide 
the value of the last update to the row, while balance type columns will 
clusterwide maintain the sum of all changes.


Whatever strategy one will use, in an async multimaster there are always 
cases that can be resolved by rules (last update being one of them), and 
some that I can't even imagine solving so far. I guess some of the cases 
will simply boil down to the application has to make sure that ... 
never occurs. Think of a multi-item order, created on one node, while 
another node is deleting the long unused item (which would have to be 
backordered). Now while those two nodes figure out what to do to make 
this consistent again, a third node does a partial shipment of that 
order. The solution is simple, reinsert the deleted item ... only that 
there were rather nasty ON DELETE CASCADE's on that item that removed 
all the consumer reviews, product descriptions, data sheets and what 
not. It's going to be an awful lot of undo.


I haven't really made up my mind about a user defined rule based 
conflict resolution interface yet. I do plan to have a unique and 
foreign key constraint based, synchronous advisory locking system on top 
of my system in a later version (advisory key locks would stay in place 
until the transaction, that placed them, replicates).


I guess you see by now why I wanted to keep the discussion about the 
individual, rather generic support features in the backend separate from 
the particular features I plan to implement in the replication system. 
Everyone has different needs and consequently an async multi-master 
must do a whole range of mutually exclusive things altogether ... 
because Postgres can never accept a partial solution. We want the egg 
laying milk-wool-pig or nothing.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Jim Nasby
Something worth noting... the only places I've actually seen MM  
replication implemented, each master was in fact still responsible  
for it's own set of data. It was essentially something that you could  
really do with Slony, if you could tolerate the extreme complexity  
that would be involved. It might well be worth focusing on that case  
first, before trying to come up with a perfect last-committed mechanism.


On Feb 5, 2007, at 5:20 AM, Zeugswetter Andreas ADI SD wrote:


I think you are completely ignoring practicability. Or are you saying,
that such a system exists and works for e.g. a loosly connected  
group of

laptop field agents that only sporadically have a connection to the
cluster.

I think Jan's definition gives a pragmatic solution to the problem,
and will be able to give good automatic conflict resolution.

It has downsides he stated, and cannot guarantee 100% correct  
automatic

conflict
resolution in case of connection loss, but I am quite sure you are not
able to do
better, without loosing yourself in theory.

e.g. assume all clocks vary by no more than 30 seconds when
disconnected, you can
require manual (or rule based) resolution to all conflicts that  
vary by

less than
1 minute.


--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-02-06 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:
Whatever strategy one will use, in an async multimaster there are always 
cases that can be resolved by rules (last update being one of them), and 
some that I can't even imagine solving so far. I guess some of the cases 
will simply boil down to the application has to make sure that ... 
never occurs. Think of a multi-item order, created on one node, while 
another node is deleting the long unused item (which would have to be 
backordered). Now while those two nodes figure out what to do to make 
this consistent again, a third node does a partial shipment of that 
order.


It helps to categorize these conflict types. There basically are:

* data conflicts: simple row data, i.e. update - update conflicts.

* uniqueness conflicts: two rows conflict because they'd violate a 
uniquenes constraint, i.e. insert - insert, update - insert or update - 
update.


* visibility conflicts: basically the remaining update - delete and 
delete - delete cases. But also SELECT FOR UPDATE candidates, etc... 
Everything having to do with a rows not yet or no longer being visible 
to a transaction.


Your example certainly involves a visibility conflict (update - delete). 
Not even (sync) Postgres-R can guarantee consistency on the visibility 
level, i.e.  a first transaction's SELECT FOR UPDATE might not see some 
just recently committed transactions newly inserted rows (because that 
one isn't replayed yet on the node, thus the transaction is working on 
an 'old' snapshot of the database state). Another simpler example: 
Postgres-R doesn't raise a serialization error on delete-delete 
conflicts, it simply deletes the row once, even if two transactions 
confirmed to have committed a transaction which deleted a row.


Luckily, most applications don't need that anyway, though.

The solution is simple, reinsert the deleted item ... 


..at which point timestamps certainly won't help :-)   Sorry, couldn't 
resist...


only that 
there were rather nasty ON DELETE CASCADE's on that item that removed 
all the consumer reviews, product descriptions, data sheets and what 
not. It's going to be an awful lot of undo.


Huh? Are you planning on aborting *parts* of a transaction? I didn't 
think about that, but my gut feeling is that you don't want to do that.


I haven't really made up my mind about a user defined rule based 
conflict resolution interface yet. I do plan to have a unique and 
foreign key constraint based, synchronous advisory locking system on top 
of my system in a later version (advisory key locks would stay in place 
until the transaction, that placed them, replicates).


You'd have to elaborate on that...

I guess you see by now why I wanted to keep the discussion about the 
individual, rather generic support features in the backend separate from 
the particular features I plan to implement in the replication system.


Sure. I know, discussions about replication can get endless, probably 
even are so by definition ;-)  But hey, they're fun!


Everyone has different needs and consequently an async multi-master 
must do a whole range of mutually exclusive things altogether ... 
because Postgres can never accept a partial solution. We want the egg 
laying milk-wool-pig or nothing.


Like the one which would result from a merge of such an async 
replication with a sync one? Imagine being able to choose between sync 
and async per transaction...


Regards

Markus


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-05 Thread Zeugswetter Andreas ADI SD

 One concept is to have a univeral clock that ticks forward (like  
 every second) and each node orders all their transactions inside the  
 second-granular tick.  Then each commit would be like: {node,  
 clocksecond, txn#} and each time the clock ticks forward, txn# is  
 reset to zero.  This gives you ordered txns that windowed in some  
 cluster-wide acceptable window (1 second).  However, this is totally  
 broken as NTP is entirely insufficient for this purpose because of a  
 variety of forms of clock skew.  As such, the timestamp should be  
 incremented via cluster consensus (one token ring or the pulse  
 generated by the leader of the current cluster membership quorom).

I think you are completely ignoring practicability. Or are you saying,
that such a system exists and works for e.g. a loosly connected group of
laptop field agents that only sporadically have a connection to the
cluster.

I think Jan's definition gives a pragmatic solution to the problem,
and will be able to give good automatic conflict resolution.

It has downsides he stated, and cannot guarantee 100% correct automatic
conflict
resolution in case of connection loss, but I am quite sure you are not
able to do
better, without loosing yourself in theory.

e.g. assume all clocks vary by no more than 30 seconds when
disconnected, you can
require manual (or rule based) resolution to all conflicts that vary by
less than 
1 minute.

Andreas

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-05 Thread Andrew Sullivan
On Sun, Feb 04, 2007 at 01:36:03PM -0500, Jan Wieck wrote:
 For the fourth time, the clock is in the mix to allow to continue during 
 a network outage. All your arguments seem to assume 100% network uptime. 
 There will be no clusterwide clock or clusterwide increment when you 
 lose connection. How does your idea cope with that?

I'm wondering whether a combined approach is needed.  This makes
things more complicated, but what if you somehow co-ordinate local
counters with shared clock ticks?  When you get a failure on your
talk to the shared clock, you regard yourself as in some sort of
failure (you're going to need softfails and that sort of thing, and
yes, I'm flapping my hands in the air at the moment).  At rejoin to
the cluster, you need some sort of way to publish here's the counter
and the last global time I had and here's my current counter.  You
can publish local time with this too, I guess, to solve for conflict
cases, but that seems like the sort of decision that needs to be
pushed down to policy level.

A

-- 
Andrew Sullivan  | [EMAIL PROTECTED]
The whole tendency of modern prose is away from concreteness.
--George Orwell

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-05 Thread Markus Schiltknecht

Hi,

Theo Schlossnagle wrote:

On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
Obviously the counters will immediately drift apart based on the 
transaction load of the nodes as soon as the network goes down. And in 
order to avoid this clock confusion and wrong expectation, you'd 
rather have a system with such a simple, non-clock based counter and 
accept that it starts behaving totally wonky when the cluster 
reconnects after a network outage? I rather confuse a few people than 
having a last update wins conflict resolution that basically rolls 
dice to determine last.


If your cluster partition and you have hours of independent action and 
upon merge you apply a conflict resolution algorithm that has enormous 
effect undoing portions of the last several hours of work on the nodes, 
you wouldn't call that wonky?


You are talking about different things. Async replication, as Jan is 
planning to do, is per se wonky, because you have to cope with 
conflicts by definition. And you have to resolve them by late-aborting a 
transaction (i.e. after a commit). Or put it another way: async MM 
replication means continuing in disconnected mode (w/o quorum or some 
such) and trying to reconciliate later on. It should not matter if the 
delay is just some milliseconds of network latency or three days (except 
of course that you probably have more data to reconciliate).


For sane disconnected (or more generally, partitioned) operation in 
multi-master environments, a quorum for the dataset must be 
established.  Now, one can consider the database to be the dataset.  
So, on network partitions those in the quorum are allowed to progress 
with data modification and others only read.


You can do this to *prevent* conflicts, but that clearly belongs to the 
world of sync replication. I'm doing this in Postgres-R: in case of 
network partitioning, only a primary partition may continue to process 
writing transactions. For async replication, it does not make sense to 
prevent conflicts when disconnected. Async is meant to cope with 
conflicts. So as to be independent of network latency.


However, there is no 
reason why the dataset _must_ be the database and that multiple datasets 
_must_ share the same quorum algorithm.  You could easily classify 
certain tables or schema or partitions into a specific dataset and apply 
a suitable quorum algorithm to that and a different quorum algorithm to 
other disjoint data sets.


I call that partitioning (among nodes). And it's applicable to sync as 
well as async replication, while it makes more sense in sync replication.


What I'm more concerned about, with Jan's proposal, is the assumption 
that you always want to resolve conflicts by time (except for balances, 
for which we don't have much information, yet). I'd rather say that time 
does not matter much if your nodes are disconnected. And (especially in 
async replication) you should prevent your clients from committing to 
one node and then reading from another, expecting to find your data 
there. So why resolve by time? It only makes the user think you could 
guarantee that order, but you certainly cannot.


Regards

Markus


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Peter Eisentraut
Jan Wieck wrote:
 This is all that is needed for last update wins resolution. And as
 said before, the only reason the clock is involved in this is so that
 nodes can continue autonomously when they lose connection without
 conflict resolution going crazy later on, which it would do if they
 were simple counters. It doesn't require microsecond synchronized
 clocks and the system clock isn't just used as a Lamport timestamp.

Earlier you said that one assumption is that all servers in the 
multimaster cluster are ntp synchronized, which already rung the alarm 
bells in me.  Now that I read this you appear to require 
synchronization not on the microsecond level but on some level.  I 
think that would be pretty hard to manage for an administrator, seeing 
that NTP typically cannot provide such guarantees.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Jan Wieck

On 2/4/2007 3:16 AM, Peter Eisentraut wrote:

Jan Wieck wrote:

This is all that is needed for last update wins resolution. And as
said before, the only reason the clock is involved in this is so that
nodes can continue autonomously when they lose connection without
conflict resolution going crazy later on, which it would do if they
were simple counters. It doesn't require microsecond synchronized
clocks and the system clock isn't just used as a Lamport timestamp.


Earlier you said that one assumption is that all servers in the 
multimaster cluster are ntp synchronized, which already rung the alarm 
bells in me.  Now that I read this you appear to require 
synchronization not on the microsecond level but on some level.  I 
think that would be pretty hard to manage for an administrator, seeing 
that NTP typically cannot provide such guarantees.


Synchronization to some degree is wanted to avoid totally unexpected 
behavior. The conflict resolution algorithm itself can perfectly fine 
live with counters, but I guess you wouldn't want the result of it. If 
you update a record on one node, then 10 minutes later you update the 
same record on another node. Unfortunately, the nodes had no 
communication and because the first node is much busier, its counter is 
way advanced ... this would mean the 10 minutes later update would get 
lost in the conflict resolution when the nodes reestablish 
communication. They would have the same data at the end, just not what 
any sane person would expect.


This behavior will kick in whenever the cross node conflicting updates 
happen close enough so that the time difference between the clocks can 
affect it. So if you update the logical same row on two nodes within a 
tenth of a second, and the clocks are more than that apart, the conflict 
resolution can result in the older row to survive. Clock synchronization 
is simply used to minimize this.


The system clock is used only to keep the counters somewhat synchronized 
in the case of connection loss to retain some degree of last update 
meaning. Without that, continuing autonomously during a network outage 
is just not practical.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Theo Schlossnagle


On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote:


On 2/4/2007 3:16 AM, Peter Eisentraut wrote:

Jan Wieck wrote:

This is all that is needed for last update wins resolution. And as
said before, the only reason the clock is involved in this is so  
that

nodes can continue autonomously when they lose connection without
conflict resolution going crazy later on, which it would do if they
were simple counters. It doesn't require microsecond synchronized
clocks and the system clock isn't just used as a Lamport timestamp.
Earlier you said that one assumption is that all servers in the  
multimaster cluster are ntp synchronized, which already rung the  
alarm bells in me.  Now that I read this you appear to require  
synchronization not on the microsecond level but on some level.  I  
think that would be pretty hard to manage for an administrator,  
seeing that NTP typically cannot provide such guarantees.


Synchronization to some degree is wanted to avoid totally  
unexpected behavior. The conflict resolution algorithm itself can  
perfectly fine live with counters, but I guess you wouldn't want  
the result of it. If you update a record on one node, then 10  
minutes later you update the same record on another node.  
Unfortunately, the nodes had no communication and because the first  
node is much busier, its counter is way advanced ... this would  
mean the 10 minutes later update would get lost in the conflict  
resolution when the nodes reestablish communication. They would  
have the same data at the end, just not what any sane person would  
expect.


This behavior will kick in whenever the cross node conflicting  
updates happen close enough so that the time difference between the  
clocks can affect it. So if you update the logical same row on two  
nodes within a tenth of a second, and the clocks are more than that  
apart, the conflict resolution can result in the older row to  
survive. Clock synchronization is simply used to minimize this.


The system clock is used only to keep the counters somewhat  
synchronized in the case of connection loss to retain some degree  
of last update meaning. Without that, continuing autonomously  
during a network outage is just not practical.


A Lamport clock addresses this.  It relies on a cluster-wise clock  
tick.  While it could be based on the system clock, it would not be  
based on more than one clock.  The point of the lamport clock is that  
there is _a_ clock, not multiple ones.


One concept is to have a univeral clock that ticks forward (like  
every second) and each node orders all their transactions inside the  
second-granular tick.  Then each commit would be like: {node,  
clocksecond, txn#} and each time the clock ticks forward, txn# is  
reset to zero.  This gives you ordered txns that windowed in some  
cluster-wide acceptable window (1 second).  However, this is totally  
broken as NTP is entirely insufficient for this purpose because of a  
variety of forms of clock skew.  As such, the timestamp should be  
incremented via cluster consensus (one token ring or the pulse  
generated by the leader of the current cluster membership quorom).


As the clock must be incremented clusterwide, the need for it to be  
insync with the system clock (on any or all of the systems) is  
obviated.  In fact, as you can't guarantee the synchronicity means  
that it can be confusing -- one expects a time-based clock to be  
accurate to the time.  A counter-based clock has no such expectations.


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Gregory Stark
Theo Schlossnagle [EMAIL PROTECTED] writes:

 As the clock must be incremented clusterwide, the need for it to be insync 
 with
 the system clock (on any or all of the systems) is  obviated.  In fact, as you
 can't guarantee the synchronicity means  that it can be confusing -- one
 expects a time-based clock to be  accurate to the time.  A counter-based clock
 has no such expectations.

So if the nodes get split they can keep operating independently but clients
can see that there's no guarantee of ordering against transactions from other
nodes because the clock isn't advancing?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Jan Wieck

On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
As the clock must be incremented clusterwide, the need for it to be  
insync with the system clock (on any or all of the systems) is  
obviated.  In fact, as you can't guarantee the synchronicity means  
that it can be confusing -- one expects a time-based clock to be  
accurate to the time.  A counter-based clock has no such expectations.


For the fourth time, the clock is in the mix to allow to continue during 
a network outage. All your arguments seem to assume 100% network uptime. 
There will be no clusterwide clock or clusterwide increment when you 
lose connection. How does your idea cope with that?


Obviously the counters will immediately drift apart based on the 
transaction load of the nodes as soon as the network goes down. And in 
order to avoid this clock confusion and wrong expectation, you'd 
rather have a system with such a simple, non-clock based counter and 
accept that it starts behaving totally wonky when the cluster reconnects 
after a network outage? I rather confuse a few people than having a last 
update wins conflict resolution that basically rolls dice to determine 
last.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-04 Thread Theo Schlossnagle


On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:


On 2/4/2007 10:53 AM, Theo Schlossnagle wrote:
As the clock must be incremented clusterwide, the need for it to  
be  insync with the system clock (on any or all of the systems)  
is  obviated.  In fact, as you can't guarantee the synchronicity  
means  that it can be confusing -- one expects a time-based clock  
to be  accurate to the time.  A counter-based clock has no such  
expectations.


For the fourth time, the clock is in the mix to allow to continue  
during a network outage. All your arguments seem to assume 100%  
network uptime. There will be no clusterwide clock or clusterwide  
increment when you lose connection. How does your idea cope with that?


That's exactly what a quorum algorithm is for.

Obviously the counters will immediately drift apart based on the  
transaction load of the nodes as soon as the network goes down. And  
in order to avoid this clock confusion and wrong expectation,  
you'd rather have a system with such a simple, non-clock based  
counter and accept that it starts behaving totally wonky when the  
cluster reconnects after a network outage? I rather confuse a few  
people than having a last update wins conflict resolution that  
basically rolls dice to determine last.


If your cluster partition and you have hours of independent action  
and upon merge you apply a conflict resolution algorithm that has  
enormous effect undoing portions of the last several hours of work on  
the nodes, you wouldn't call that wonky?


For sane disconnected (or more generally, partitioned) operation in  
multi-master environments, a quorum for the dataset must be  
established.  Now, one can consider the database to be the  
dataset.  So, on network partitions those in the quorum are allowed  
to progress with data modification and others only read.  However,  
there is no reason why the dataset _must_ be the database and that  
multiple datasets _must_ share the same quorum algorithm.  You could  
easily classify certain tables or schema or partitions into a  
specific dataset and apply a suitable quorum algorithm to that and a  
different quorum algorithm to other disjoint data sets.



// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Jan Wieck

On 2/1/2007 11:23 PM, Jim Nasby wrote:

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the  
timestamp will be truncated to milliseconds and the increment logic  
is done on milliseconds. The priority is added to the timestamp.  
This guarantees that no two timestamps for commits will ever be  
exactly identical, even across different servers.


Wouldn't it be better to just store that information separately,  
rather than mucking with the timestamp?


Though, there's anothe issue here... I don't think NTP is good for  
any better than a few milliseconds, even on a local network.


How exact does the conflict resolution need to be, anyway? Would it  
really be a problem if transaction B committed 0.1 seconds after  
transaction A yet the cluster thought it was the other way around?


Since the timestamp is basically a Lamport counter which is just bumped 
be the clock as well, it doesn't need to be too precise.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Theo Schlossnagle


On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:


On 2/1/2007 11:23 PM, Jim Nasby wrote:

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the   
timestamp will be truncated to milliseconds and the increment  
logic  is done on milliseconds. The priority is added to the  
timestamp.  This guarantees that no two timestamps for commits  
will ever be  exactly identical, even across different servers.
Wouldn't it be better to just store that information separately,   
rather than mucking with the timestamp?
Though, there's anothe issue here... I don't think NTP is good  
for  any better than a few milliseconds, even on a local network.
How exact does the conflict resolution need to be, anyway? Would  
it  really be a problem if transaction B committed 0.1 seconds  
after  transaction A yet the cluster thought it was the other way  
around?


Since the timestamp is basically a Lamport counter which is just  
bumped be the clock as well, it doesn't need to be too precise.


Unless I'm missing something, you are _treating_ the counter as a  
Lamport timestamp, when in fact it is not and thus does not provide  
semantics of a Lamport timestamp.  As such, any algorithms that use  
lamport timestamps as a basis or assumption for the proof of their  
correctness will not translate (provably) to this system.


How are your counter semantically equivalent to Lamport timestamps?

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Jan Wieck

On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:

On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:


On 2/1/2007 11:23 PM, Jim Nasby wrote:

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the   
timestamp will be truncated to milliseconds and the increment  
logic  is done on milliseconds. The priority is added to the  
timestamp.  This guarantees that no two timestamps for commits  
will ever be  exactly identical, even across different servers.
Wouldn't it be better to just store that information separately,   
rather than mucking with the timestamp?
Though, there's anothe issue here... I don't think NTP is good  
for  any better than a few milliseconds, even on a local network.
How exact does the conflict resolution need to be, anyway? Would  
it  really be a problem if transaction B committed 0.1 seconds  
after  transaction A yet the cluster thought it was the other way  
around?


Since the timestamp is basically a Lamport counter which is just  
bumped be the clock as well, it doesn't need to be too precise.


Unless I'm missing something, you are _treating_ the counter as a  
Lamport timestamp, when in fact it is not and thus does not provide  
semantics of a Lamport timestamp.  As such, any algorithms that use  
lamport timestamps as a basis or assumption for the proof of their  
correctness will not translate (provably) to this system.


How are your counter semantically equivalent to Lamport timestamps?


Yes, you must be missing something.

The last used timestamp is remembered. When a remote transaction is 
replicated, the remembered timestamp is set to max(remembered, remote). 
For a local transaction, the remembered timestamp is set to 
max(remembered+1ms, systemclock) and that value is used as the 
transaction commit timestamp.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Theo Schlossnagle


On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:


On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:

On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:

On 2/1/2007 11:23 PM, Jim Nasby wrote:

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the
timestamp will be truncated to milliseconds and the increment   
logic  is done on milliseconds. The priority is added to the   
timestamp.  This guarantees that no two timestamps for commits   
will ever be  exactly identical, even across different servers.
Wouldn't it be better to just store that information  
separately,   rather than mucking with the timestamp?
Though, there's anothe issue here... I don't think NTP is good   
for  any better than a few milliseconds, even on a local network.
How exact does the conflict resolution need to be, anyway?  
Would  it  really be a problem if transaction B committed 0.1  
seconds  after  transaction A yet the cluster thought it was the  
other way  around?


Since the timestamp is basically a Lamport counter which is just   
bumped be the clock as well, it doesn't need to be too precise.
Unless I'm missing something, you are _treating_ the counter as a   
Lamport timestamp, when in fact it is not and thus does not  
provide  semantics of a Lamport timestamp.  As such, any  
algorithms that use  lamport timestamps as a basis or assumption  
for the proof of their  correctness will not translate (provably)  
to this system.

How are your counter semantically equivalent to Lamport timestamps?


Yes, you must be missing something.

The last used timestamp is remembered. When a remote transaction is  
replicated, the remembered timestamp is set to max(remembered,  
remote). For a local transaction, the remembered timestamp is set  
to max(remembered+1ms, systemclock) and that value is used as the  
transaction commit timestamp.


A Lamport clock, IIRC, require a cluster wide tick.  This seems based  
only on activity and is thus an observational tick only which means  
various nodes can have various perspectives at different times.


Given that time skew is prevalent, why is the system clock involved  
at all?


As is usual distributed systems problems, they are very hard to  
explain casually and also hard to review from a theoretical angle  
without a proof.  Are you basing this off a paper?  If so which one?   
If not, have you written a rigorous proof of correctness for this  
approach?


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Jan Wieck

On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:

On Feb 3, 2007, at 4:38 PM, Jan Wieck wrote:


On 2/3/2007 4:05 PM, Theo Schlossnagle wrote:

On Feb 3, 2007, at 3:52 PM, Jan Wieck wrote:

On 2/1/2007 11:23 PM, Jim Nasby wrote:

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the
timestamp will be truncated to milliseconds and the increment   
logic  is done on milliseconds. The priority is added to the   
timestamp.  This guarantees that no two timestamps for commits   
will ever be  exactly identical, even across different servers.
Wouldn't it be better to just store that information  
separately,   rather than mucking with the timestamp?
Though, there's anothe issue here... I don't think NTP is good   
for  any better than a few milliseconds, even on a local network.
How exact does the conflict resolution need to be, anyway?  
Would  it  really be a problem if transaction B committed 0.1  
seconds  after  transaction A yet the cluster thought it was the  
other way  around?


Since the timestamp is basically a Lamport counter which is just   
bumped be the clock as well, it doesn't need to be too precise.
Unless I'm missing something, you are _treating_ the counter as a   
Lamport timestamp, when in fact it is not and thus does not  
provide  semantics of a Lamport timestamp.  As such, any  
algorithms that use  lamport timestamps as a basis or assumption  
for the proof of their  correctness will not translate (provably)  
to this system.

How are your counter semantically equivalent to Lamport timestamps?


Yes, you must be missing something.

The last used timestamp is remembered. When a remote transaction is  
replicated, the remembered timestamp is set to max(remembered,  
remote). For a local transaction, the remembered timestamp is set  
to max(remembered+1ms, systemclock) and that value is used as the  
transaction commit timestamp.


A Lamport clock, IIRC, require a cluster wide tick.  This seems based  
only on activity and is thus an observational tick only which means  
various nodes can have various perspectives at different times.


Given that time skew is prevalent, why is the system clock involved  
at all?


This question was already answered.

As is usual distributed systems problems, they are very hard to  
explain casually and also hard to review from a theoretical angle  
without a proof.  Are you basing this off a paper?  If so which one?   
If not, have you written a rigorous proof of correctness for this  
approach?


I don't have any such paper and the proof of concept will be the 
implementation of the system. I do however see enough resistance against 
this proposal to withdraw the commit timestamp at this time. The new 
replication system will therefore require the installation of a patched, 
non-standard PostgreSQL version, compiled from sources cluster wide in 
order to be used. I am aware that this will dramatically reduce it's 
popularity but it is impossible to develop this essential feature as an 
external module.


I thank everyone for their attention.


Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Theo Schlossnagle


On Feb 3, 2007, at 5:09 PM, Jan Wieck wrote:


On 2/3/2007 4:58 PM, Theo Schlossnagle wrote:

I don't have any such paper and the proof of concept will be the  
implementation of the system. I do however see enough resistance  
against this proposal to withdraw the commit timestamp at this  
time. The new replication system will therefore require the  
installation of a patched, non-standard PostgreSQL version,  
compiled from sources cluster wide in order to be used. I am aware  
that this will dramatically reduce it's popularity but it is  
impossible to develop this essential feature as an external module.


I thank everyone for their attention.


Actually, I believe the commit timestamp stuff would be very useful  
in general.  I would certainly like to see rigorous proofs of any   
multi-master replication technology built on top of them.  I believe  
that while your replication stuff might rely on the commit  
timestamps, the commit timestamps rely on thing else (except the work  
that you have been churning on).


Using commit timestamps, one can easily implement cross vendor  
database replication.  These can be used to implement something like  
trigger selective redo logs.  I think they can be used to produce DML  
logs that will require a lot less accounting to manage replicating  
tables from PostgreSQL into another database (like Oracle or MySQL).


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Bruce Momjian
Jan Wieck wrote:
 I don't have any such paper and the proof of concept will be the 
 implementation of the system. I do however see enough resistance against 
 this proposal to withdraw the commit timestamp at this time. The new 
 replication system will therefore require the installation of a patched, 
 non-standard PostgreSQL version, compiled from sources cluster wide in 
 order to be used. I am aware that this will dramatically reduce it's 
 popularity but it is impossible to develop this essential feature as an 
 external module.
 
 I thank everyone for their attention.

Going and working on it on your own doesn't seem like the proper
solution.  I don't see people objecting to adding it, but they want it
work, which I am sure you want too.  You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Proposal: Commit timestamp

2007-02-03 Thread Jan Wieck

On 2/3/2007 5:20 PM, Bruce Momjian wrote:

Jan Wieck wrote:
I don't have any such paper and the proof of concept will be the 
implementation of the system. I do however see enough resistance against 
this proposal to withdraw the commit timestamp at this time. The new 
replication system will therefore require the installation of a patched, 
non-standard PostgreSQL version, compiled from sources cluster wide in 
order to be used. I am aware that this will dramatically reduce it's 
popularity but it is impossible to develop this essential feature as an 
external module.


I thank everyone for their attention.


Going and working on it on your own doesn't seem like the proper
solution.  I don't see people objecting to adding it, but they want it
work, which I am sure you want too.  You have to show how it will work
and convince others of that, and then you have a higher chance it will
work, and be in the PostgreSQL codebase.


Bruce,

I think I have sufficiently detailed explained how this Lamport 
timestamp will be unique and ever increasing, with the nodes ID being 
used as a tie breaker. The only thing important for last update wins 
conflict resolution is that whatever timestamp you have associated with 
a row, the update you do to it must be associated with a later timestamp 
so that all other nodes will overwrite the data. If a third node gets 
the two updates out of order, it will do the second nodes update and 
since the row it has then has a later timestamp then the first update 
arriving late, it will throw away that information. All nodes in sync 
again.


This is all that is needed for last update wins resolution. And as said 
before, the only reason the clock is involved in this is so that nodes 
can continue autonomously when they lose connection without conflict 
resolution going crazy later on, which it would do if they were simple 
counters. It doesn't require microsecond synchronized clocks and the 
system clock isn't just used as a Lamport timestamp.


The problem seems to me that people want a full scale proof of concept 
for the whole multimaster replication system I'm planning instead of 
thinking isolated about this one aspect, the intended use case and other 
possible uses for it (like table logging). And we all know that that 
discussion will take us way behind the 8.3 feature freeze date, so the 
whole thing will never get done.


I don't want to work on this on my own and I sure would prefer it to be 
a default PostgreSQL feature. As said, I have learned some things from 
Slony-I. One of them is that I will not go through any more ugly 
workarounds in order to not require a patched backend. If the features I 
really need aren't going to be in the default codebase, people will have 
to install from patched sources.


Finally, again, Slony-I could have well used this feature. With a 
logical commit timestamp, I would have never even thought about that 
other wart called xxid. It would have all been sooo much easier.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Proposal: Commit timestamp

2007-02-02 Thread Jim Nasby

On Jan 25, 2007, at 6:16 PM, Jan Wieck wrote:
If a per database configurable tslog_priority is given, the  
timestamp will be truncated to milliseconds and the increment logic  
is done on milliseconds. The priority is added to the timestamp.  
This guarantees that no two timestamps for commits will ever be  
exactly identical, even across different servers.


Wouldn't it be better to just store that information separately,  
rather than mucking with the timestamp?


Though, there's anothe issue here... I don't think NTP is good for  
any better than a few milliseconds, even on a local network.


How exact does the conflict resolution need to be, anyway? Would it  
really be a problem if transaction B committed 0.1 seconds after  
transaction A yet the cluster thought it was the other way around?

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-01-27 Thread Gregory Stark

Jan Wieck [EMAIL PROTECTED] writes:

 I think the system I described is a slightly modified Lamport generator. The
 maximum timestamp of any row updated in this transaction, you can consider 
 that
 the counters received from other nodes. Then I make sure that the next
 counter (timestamp) is higher than anything I know so far, and I add
 cluster-wide unique tie breaker to that.

If you know all the timestamps in the system then you don't need timestamps at
all, just use a counter that you increment by one each time.

Isn't the whole reason people use timestamps is so that you don't have to
depend on atomically knowing every timestamp in the system? So two
transactions can commit simultaneously on different systems and use the
timestamps to resolve conflicts later.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-01-27 Thread Jan Wieck

On 1/27/2007 7:26 AM, Gregory Stark wrote:

Jan Wieck [EMAIL PROTECTED] writes:


I think the system I described is a slightly modified Lamport generator. The
maximum timestamp of any row updated in this transaction, you can consider that
the counters received from other nodes. Then I make sure that the next
counter (timestamp) is higher than anything I know so far, and I add
cluster-wide unique tie breaker to that.


If you know all the timestamps in the system then you don't need timestamps at
all, just use a counter that you increment by one each time.

Isn't the whole reason people use timestamps is so that you don't have to
depend on atomically knowing every timestamp in the system? So two
transactions can commit simultaneously on different systems and use the
timestamps to resolve conflicts later.


This assumes that you never lose contact to the cluster or if so, 
instantly stop all update activity because you are at risk that the 
counters diverge. This risk is much higher with a simple counter than 
with a system clock that was in sync at the time of disconnect.


With all the disadvantages and the pain factor of an asynchronous 
multimaster replication system comes one big advantage. You can continue 
autonomously and let conflict resolution figure it out later.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Naz Gassiep
I would be *very* concerned that system time is not a guaranteed 
monotonic entity. Surely a counter or other internally managed mechanism 
would be a better solution.


Furthermore, what would be the ramifications of master and slave system 
times being out of sync?


Finally what if system time is rolled forward a few minutes as part of a 
correction and there were transactions completed in that time? There is 
a change, albeit small, that two transactions will have the same 
timestamp. More importantly, this will throw all kinds of issues in when 
the slave sees transactions in the future. Even with regular NTP syncs, 
drift can cause a clock to be rolled forward a few milliseconds, 
possibly resulting in duplicate transaction IDs.


In summary, I don't think the use of system time has any place in 
PostgreSQL's internal consistency mechanisms, it is too unreliable an 
environment property. Why can't a counter be used for this instead?


- Naz.

Jan Wieck wrote:
For a future multimaster replication system, I will need a couple of 
features in the PostgreSQL server itself. I will submit separate 
proposals per feature so that discussions can be kept focused on one 
feature per thread.


For conflict resolution purposes in an asynchronous multimaster 
system, the last update definition often comes into play. For this 
to work, the system must provide a monotonically increasing timestamp 
taken at the commit of a transaction. During replication, the 
replication process must be able to provide the remote nodes timestamp 
so that the replicated data will be as of the time it was written on 
the remote node, and not the current local time of the replica, which 
is by definition of asynchronous later.


To provide this data, I would like to add another log directory, 
pg_tslog. The files in this directory will be similar to the clog, but 
contain arrays of timestamptz values. On commit, the current system 
time will be taken. As long as this time is lower or equal to the last 
taken time in this PostgreSQL instance, the value will be increased by 
one microsecond. The resulting time will be added to the commit WAL 
record and written into the pg_tslog file.


If a per database configurable tslog_priority is given, the timestamp 
will be truncated to milliseconds and the increment logic is done on 
milliseconds. The priority is added to the timestamp. This guarantees 
that no two timestamps for commits will ever be exactly identical, 
even across different servers.


The COMMIT syntax will get extended to

COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];

The extension is limited to superusers and will override the normally 
generated commit timestamp. This will be used to give the replicating 
transaction on the replica the exact same timestamp it got on the 
originating master node.


The pg_tslog segments will be purged like the clog segments, after all 
transactions belonging to them have been stamped frozen. A frozen xid 
by definition has a timestamp of epoch. To ensure a system using this 
timestamp feature has enough time to perform its work, a new GUC 
variable defining an interval will prevent vacuum from freezing xid's 
that are younger than that.


A function get_commit_timestamp(xid) returning timpstamptz will return 
the commit time of a transaction as recorded by this feature.



Comments, changes, additions?

Jan



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Markus Schiltknecht

Hi,

Jan Wieck wrote:
The replication system I have in mind will have another field type of 
the balance nature, where it will never communicate the current value 
but only deltas that get applied regardless of the two timestamps.


I'd favor a more generally usable conflict resolution function 
interface, on top of which you can implement both, the last update 
wins as well as the balance conflict resolution type.


Passing the last common ancestor and the two conflicting heads to the 
conflict resolution function (CRF) should be enough. That would easily 
allow to implement the balance type (as you can calculate both 
deltas). And if you want to rely on something as arbitrary as a 
timestamp, you'd simply have to add a timestamp column to your table and 
let the CRF decide uppon that.


This would allow pretty much any type of conflict resolution, for 
example: higher priority cleanup transactions, which change lots of 
tuples and should better not be aborted later on. Those could be 
implemented by adding a priority column and having the CRF respect that 
one, too.


To find the last common ancestor tuple, transaction ids and MVCC are 
enough. You wouldn't need to add timestamps. You'd only have to make 
sure VACUUM doesn't delete tuples you still need.


Regards

Markus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Heikki Linnakangas

Jan Wieck wrote:
But it is a datum that needs to be collected at the moment where 
basically the clog entry is made ... I don't think any external module 
can do that ever.


How atomic does it need to be? External modules can register callbacks 
that get called right after the clog update and removing the xid from 
MyProc entry. That's about as close to making the clog entry you can get.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Theo Schlossnagle

Jan, et. al.,

On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote:
I would be *very* concerned that system time is not a guaranteed  
monotonic entity. Surely a counter or other internally managed  
mechanism would be a better solution.


As you should be concerned.  Looking on my desk through the last few  
issues in IEEE Transactions on Parallel and Distributed Systems, I  
see no time synch stuff for clusters of machines that is actually  
based on time.  Almost all rely on something like a Lamport timestamp  
or some relaxation thereof.  A few are based off a tree based pulse.   
Using actual times is fraught with problems and is typically  
inappropriate for cluster synchronization needs.


Furthermore, what would be the ramifications of master and slave  
system times being out of sync?


I'm much more concerned with the overall approach.  The algorithm for  
replication should be published in theoretic style with a thorough  
analysis of its assumptions and a proof of correctness based on those  
assumptions.  Databases and replication therein are definitely  
technologies that aren't off-the-cuff, and rigorous academic  
discussion and acceptance before they will get adopted.  People  
generally will not adopt technologies to store mission critical data  
until they are confident that it will both work as designed and work  
as implemented -- the second is far less important as the weakness  
there are simply bugs.


I'm not implying that this rigorous dissection of replication design  
hasn't happened, but I didn't see it referenced anywhere in this  
thread.  Can you point me to it?  I've reviewed many of these papers  
and would like to better understand what you are aiming at.


Best regards,

Theo Schlossnagle



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Jan Wieck

On 1/26/2007 2:37 AM, Naz Gassiep wrote:
I would be *very* concerned that system time is not a guaranteed 
monotonic entity. Surely a counter or other internally managed mechanism 
would be a better solution.


Such a counter has only local relevance. How do you plan to compare 
the two separate counters on different machines to tell which 
transaction happened last?


Even if the system clock isn't monotonically increasing, the described 
increment system guarantees the timestamp used to appear so. Granted, 
this system will not work too well on a platform that doesn't allow to 
slew the system clock.




Furthermore, what would be the ramifications of master and slave system 
times being out of sync?


The origin of a transaction must scan all tuples it updates and make 
sure that the timestamp it uses for commit appears in the future with 
respect to them.




Finally what if system time is rolled forward a few minutes as part of a 
correction and there were transactions completed in that time? There is 
a change, albeit small, that two transactions will have the same 
timestamp. More importantly, this will throw all kinds of issues in when 
the slave sees transactions in the future. Even with regular NTP syncs, 
drift can cause a clock to be rolled forward a few milliseconds, 
possibly resulting in duplicate transaction IDs.


In summary, I don't think the use of system time has any place in 
PostgreSQL's internal consistency mechanisms, it is too unreliable an 
environment property. Why can't a counter be used for this instead?


This is nothing used for PostgreSQL's consistency. It is a vehicle 
intended to be used to synchronize the last update wins decision 
process of an asynchronous multimaster system. If not with a timestamp, 
how would you make sure that the replication processes of two different 
nodes will come to the same conclusion as to which update was last? 
Especially considering that the replication might take place hours after 
the original transaction happened.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Simon Riggs
On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:

 To provide this data, I would like to add another log directory, 
 pg_tslog. The files in this directory will be similar to the clog, but 
 contain arrays of timestamptz values. On commit, the current system time 
 will be taken. As long as this time is lower or equal to the last taken 
 time in this PostgreSQL instance, the value will be increased by one 
 microsecond. The resulting time will be added to the commit WAL record 
 and written into the pg_tslog file.

A transaction time table/log has other uses as well, so its fairly
interesting to have this.

  COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];
 
 The extension is limited to superusers and will override the normally 
 generated commit timestamp. 

I don't think its acceptable to override the normal timestamp. That
could lead to non monotonic time values which could screw up PITR. My
view is that you still need PITR even when you are using replication,
because the former provides recoverability and the latter provides
availability.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Jan Wieck

On 1/26/2007 8:26 AM, Simon Riggs wrote:

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:

To provide this data, I would like to add another log directory, 
pg_tslog. The files in this directory will be similar to the clog, but 
contain arrays of timestamptz values. On commit, the current system time 
will be taken. As long as this time is lower or equal to the last taken 
time in this PostgreSQL instance, the value will be increased by one 
microsecond. The resulting time will be added to the commit WAL record 
and written into the pg_tslog file.


A transaction time table/log has other uses as well, so its fairly
interesting to have this.


 COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];

The extension is limited to superusers and will override the normally 
generated commit timestamp. 


I don't think its acceptable to override the normal timestamp. That
could lead to non monotonic time values which could screw up PITR. My
view is that you still need PITR even when you are using replication,
because the former provides recoverability and the latter provides
availability.


Without that it is rendered useless for conflict resolution purposes.

The timestamp used does not necessarily have much to do with the real 
time at commit. Although I'd like it to be as close as possible. This 
timestamp marks the age of the new datum in an update. Since the 
replication is asynchronous, the update on the remote systems will 
happen later, but the timestamp recorded with that datum must be the 
timestamp of the original transaction, not the current time when it is 
replicated remotely. All we have to determine that is the xmin in the 
rows tuple header, so that xmin must resolve to the original 
transactions timestamp.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Stephen Frost
* Jan Wieck ([EMAIL PROTECTED]) wrote:
 On 1/26/2007 2:37 AM, Naz Gassiep wrote:
 I would be *very* concerned that system time is not a guaranteed 
 monotonic entity. Surely a counter or other internally managed mechanism 
 would be a better solution.
 
 Such a counter has only local relevance. How do you plan to compare 
 the two separate counters on different machines to tell which 
 transaction happened last?

I'd also suggest you look into Lamport timestamps...  Trusting the
system clock just isn't practical, even with NTP.  I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist.  I've probably got some code I could
share as well.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Andrew Dunstan

Stephen Frost wrote:

I'd also suggest you look into Lamport timestamps...  Trusting the
system clock just isn't practical, even with NTP.  I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist.  I've probably got some code I could
share as well.
  


that looks like what Oracle RAC uses: 
http://www.lc.leidenuniv.nl/awcourse/oracle/rac.920/a96597/coord.htm


cheers

andrew



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-01-26 Thread Jan Wieck

On 1/26/2007 9:38 AM, Stephen Frost wrote:

* Jan Wieck ([EMAIL PROTECTED]) wrote:

On 1/26/2007 2:37 AM, Naz Gassiep wrote:
I would be *very* concerned that system time is not a guaranteed 
monotonic entity. Surely a counter or other internally managed mechanism 
would be a better solution.


Such a counter has only local relevance. How do you plan to compare 
the two separate counters on different machines to tell which 
transaction happened last?


I'd also suggest you look into Lamport timestamps...  Trusting the
system clock just isn't practical, even with NTP.  I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist.  I've probably got some code I could
share as well.


I think the system I described is a slightly modified Lamport generator. 
The maximum timestamp of any row updated in this transaction, you can 
consider that the counters received from other nodes. Then I make sure 
that the next counter (timestamp) is higher than anything I know so far, 
and I add cluster-wide unique tie breaker to that.


Looking closer, I don't even have to check the timestamps of the rows 
updated. Since a remote transaction replicated will bump the local 
Lamport clock on commit, a local transaction modifying such a row will 
have a timestamp in the future of that remote transaction, even if my 
local clock is limping behind.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck
For a future multimaster replication system, I will need a couple of 
features in the PostgreSQL server itself. I will submit separate 
proposals per feature so that discussions can be kept focused on one 
feature per thread.


For conflict resolution purposes in an asynchronous multimaster system, 
the last update definition often comes into play. For this to work, 
the system must provide a monotonically increasing timestamp taken at 
the commit of a transaction. During replication, the replication process 
must be able to provide the remote nodes timestamp so that the 
replicated data will be as of the time it was written on the remote 
node, and not the current local time of the replica, which is by 
definition of asynchronous later.


To provide this data, I would like to add another log directory, 
pg_tslog. The files in this directory will be similar to the clog, but 
contain arrays of timestamptz values. On commit, the current system time 
will be taken. As long as this time is lower or equal to the last taken 
time in this PostgreSQL instance, the value will be increased by one 
microsecond. The resulting time will be added to the commit WAL record 
and written into the pg_tslog file.


If a per database configurable tslog_priority is given, the timestamp 
will be truncated to milliseconds and the increment logic is done on 
milliseconds. The priority is added to the timestamp. This guarantees 
that no two timestamps for commits will ever be exactly identical, even 
across different servers.


The COMMIT syntax will get extended to

COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];

The extension is limited to superusers and will override the normally 
generated commit timestamp. This will be used to give the replicating 
transaction on the replica the exact same timestamp it got on the 
originating master node.


The pg_tslog segments will be purged like the clog segments, after all 
transactions belonging to them have been stamped frozen. A frozen xid by 
definition has a timestamp of epoch. To ensure a system using this 
timestamp feature has enough time to perform its work, a new GUC 
variable defining an interval will prevent vacuum from freezing xid's 
that are younger than that.


A function get_commit_timestamp(xid) returning timpstamptz will return 
the commit time of a transaction as recorded by this feature.



Comments, changes, additions?

Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Neil Conway
On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
 For conflict resolution purposes in an asynchronous multimaster system, 
 the last update definition often comes into play. For this to work, 
 the system must provide a monotonically increasing timestamp taken at 
 the commit of a transaction.

Do you really need an actual timestamptz derived from the system clock,
or would a monotonically increasing 64-bit counter be sufficient? (The
assumption that the system clock is monotonically increasing seems
pretty fragile, in the presence of manual system clock changes, ntpd,
etc.)

 Comments, changes, additions?

Would this feature have any use beyond the specific project/algorithm
you have in mind?

-Neil



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Tom Lane
Jan Wieck [EMAIL PROTECTED] writes:
 To provide this data, I would like to add another log directory, 
 pg_tslog. The files in this directory will be similar to the clog, but 
 contain arrays of timestamptz values.

Why should everybody be made to pay this overhead?

 The COMMIT syntax will get extended to
 COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];
 The extension is limited to superusers and will override the normally 
 generated commit timestamp. This will be used to give the replicating 
 transaction on the replica the exact same timestamp it got on the 
 originating master node.

I'm not convinced you've even thought this through.  If you do that then
you have no guarantee of commit timestamp monotonicity on the slave
(if it has either multi masters or any locally generated transactions).
Since this is supposedly for a multi-master system, that seems a rather
fatal objection --- no node in the system will actually have commit
timestamp monotonicity.  What are you hoping to accomplish with this?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck

On 1/25/2007 6:47 PM, Neil Conway wrote:

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
For conflict resolution purposes in an asynchronous multimaster system, 
the last update definition often comes into play. For this to work, 
the system must provide a monotonically increasing timestamp taken at 
the commit of a transaction.


Do you really need an actual timestamptz derived from the system clock,
or would a monotonically increasing 64-bit counter be sufficient? (The
assumption that the system clock is monotonically increasing seems
pretty fragile, in the presence of manual system clock changes, ntpd,
etc.)


Yes, I do need it to be a timestamp, and one assumption is that all 
servers in the multimaster cluster are ntp synchronized. The reason is 
that this is for asynchronous multimaster (in my case). Two sequences 
running on separate systems don't tell which was the last update on a 
timeline. This conflict resolution method alone is of course completely 
inadequate.





Comments, changes, additions?


Would this feature have any use beyond the specific project/algorithm
you have in mind?


The tablelog project on pgfoundry currently uses the transactions start 
time but would be very delighted to have the commit time available instead.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Tom Lane
Jan Wieck [EMAIL PROTECTED] writes:
 On 1/25/2007 6:47 PM, Neil Conway wrote:
 Would this feature have any use beyond the specific project/algorithm
 you have in mind?

 The tablelog project on pgfoundry currently uses the transactions start 
 time but would be very delighted to have the commit time available instead.

BTW, it's not clear to me why you need a new log area for this.  (We
don't log transaction start time anywhere, so certainly tablelog's needs
would not include it.)  Commit timestamps are available from WAL commit
records in a crash-and-restart scenario, so wouldn't that be enough?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck

On 1/25/2007 6:49 PM, Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:
To provide this data, I would like to add another log directory, 
pg_tslog. The files in this directory will be similar to the clog, but 
contain arrays of timestamptz values.


Why should everybody be made to pay this overhead?


It could be made an initdb time option. If you intend to use a product 
that requires this feature, you will be willing to pay that price.





The COMMIT syntax will get extended to
COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz];
The extension is limited to superusers and will override the normally 
generated commit timestamp. This will be used to give the replicating 
transaction on the replica the exact same timestamp it got on the 
originating master node.


I'm not convinced you've even thought this through.  If you do that then
you have no guarantee of commit timestamp monotonicity on the slave
(if it has either multi masters or any locally generated transactions).
Since this is supposedly for a multi-master system, that seems a rather
fatal objection --- no node in the system will actually have commit
timestamp monotonicity.  What are you hoping to accomplish with this?


Maybe I wasn't clear enough about this. If the commit timestamps on the 
local machine are guaranteed to increase at least by one millisecond 
(okay that limits the system to a sustained 1000 commits per second 
before it really seems to run ahead of time), then no two commits on the 
same instance will ever have the same timestamp. If furthermore each 
instance in a cluster has a distinct priority (the microsecond part 
added to the millisecond-truncated timestamp), each commit timestamp 
could even act as a globally unique ID. It does require that all the 
nodes in the cluster are configured with a distinct priority.


What I hope to accomplish with this is a very easy, commit time based 
last update wins conflict resolution for data fields of the overwrite 
nature.


The replication system I have in mind will have another field type of 
the balance nature, where it will never communicate the current value 
but only deltas that get applied regardless of the two timestamps.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck

On 1/25/2007 7:41 PM, Tom Lane wrote:

Jan Wieck [EMAIL PROTECTED] writes:

On 1/25/2007 6:47 PM, Neil Conway wrote:

Would this feature have any use beyond the specific project/algorithm
you have in mind?


The tablelog project on pgfoundry currently uses the transactions start 
time but would be very delighted to have the commit time available instead.


BTW, it's not clear to me why you need a new log area for this.  (We
don't log transaction start time anywhere, so certainly tablelog's needs
would not include it.)  Commit timestamps are available from WAL commit
records in a crash-and-restart scenario, so wouldn't that be enough?


First, I need the timestamp of the original transaction that caused the 
data to change, which can be a remote or a local transaction. So the 
timestamp currently recorded in the WAL commit record is useless and the 
commit record has to be extended by one more timestamp.


Second, I don't think that an API scanning for WAL commit records by xid 
would be efficient enough to satisfy the needs of a timestamp based 
conflict resolution system, which would have to retrieve the timestamp 
for every rows xmin that it is about to update in order to determine if 
the old or the new values should be used.


Third, keeping the timestamp information in the WAL only would require 
to keep the WAL segments around until they are older than the admin 
chosen minimum freeze age. I hope you don't want to force that penalty 
on everyone who intends to use multimaster replication.



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Richard Troy

On Thu, 25 Jan 2007, Jan Wieck wrote:

 For a future multimaster replication system, I will need a couple of
 features in the PostgreSQL server itself. I will submit separate
 proposals per feature so that discussions can be kept focused on one
 feature per thread.

Hmm... will need ... Have you prototyped this system yet? ISTM you can
prototype your proposal using external components so you can work out
the kinks first.

Richard


-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck

On 1/25/2007 8:42 PM, Richard Troy wrote:

On Thu, 25 Jan 2007, Jan Wieck wrote:


For a future multimaster replication system, I will need a couple of
features in the PostgreSQL server itself. I will submit separate
proposals per feature so that discussions can be kept focused on one
feature per thread.


Hmm... will need ... Have you prototyped this system yet? ISTM you can
prototype your proposal using external components so you can work out
the kinks first.


These details are pretty drilled down and are needed with the described 
functionality. And I will not make the same mistake as with Slony-I 
again and develop things, that require backend support, as totally 
external (look at the catalog corruption mess I created there and you 
know what I'm talking about).



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Bruce Momjian
Jan Wieck wrote:
 On 1/25/2007 6:49 PM, Tom Lane wrote:
  Jan Wieck [EMAIL PROTECTED] writes:
  To provide this data, I would like to add another log directory, 
  pg_tslog. The files in this directory will be similar to the clog, but 
  contain arrays of timestamptz values.
  
  Why should everybody be made to pay this overhead?
 
 It could be made an initdb time option. If you intend to use a product 
 that requires this feature, you will be willing to pay that price.

That is going to cut your usage by like 80%.  There must be a better
way.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Proposal: Commit timestamp

2007-01-25 Thread Jan Wieck

On 1/25/2007 11:41 PM, Bruce Momjian wrote:

Jan Wieck wrote:

On 1/25/2007 6:49 PM, Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
 To provide this data, I would like to add another log directory, 
 pg_tslog. The files in this directory will be similar to the clog, but 
 contain arrays of timestamptz values.
 
 Why should everybody be made to pay this overhead?


It could be made an initdb time option. If you intend to use a product 
that requires this feature, you will be willing to pay that price.


That is going to cut your usage by like 80%.  There must be a better
way.


I'd love to.

But it is a datum that needs to be collected at the moment where 
basically the clog entry is made ... I don't think any external module 
can do that ever.


You know how long I've been in and out and back into replication again. 
The one thing that pops up again and again in all the scenarios is what 
the heck was the commit order?. Now the pure commit order for a single 
node could certainly be recorded from a sequence, but that doesn't cover 
the multi-node environment I am after. That's why I want it to be a 
timestamp with a few fudged bits at the end. If you look at what I've 
described, you will notice that as long as all node priorities are 
unique, this timestamp will be a globally unique ID in a somewhat 
ascending order along a timeline. That is what replication people are 
looking for.


Tom fears that the overhead is significant, which I do understand and 
frankly, wonder myself about (actually I don't even have a vague 
estimate). I really think we should make this thing an initdb option and 
decide later if it's on or off by default. Probably we can implement it 
even in a way that one can turn it on/off and a postmaster restart plus 
waiting the desired freeze-delay would do.


What I know for certain is that no async replication system can ever do 
without the commit timestamp information. Using the transaction start 
time or even the single statements timeofday will only lead to 
inconsistencies all over the place (I haven't been absent from the 
mailing lists for the past couple of month hiding in my closet ... I've 
been experimenting and trying to get around all these issues - in my 
closet). Slony-I can survive without that information because everything 
happens on one node and we record snapshot information for later abusal. 
But look at what cost we are dealing with this rather trivial issue. All 
we need to know is the serializable commit order. And we have to issue 
queries that eventually might exceed address space limits?



Jan

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org