Re: [HACKERS] GSoC - proposal - Materialized Views in PostgreSQL

2010-04-11 Thread Florian G. Pflug

On 11.04.10 20:47 , Robert Haas wrote:

On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
  wrote:

Robert Haas wrote:

2010/4/10 Andrew Dunstan:

Heikki Linnakangas wrote:

1. Keep the materialized view up-to-date when the base tables
change. This can be further divided into many steps, you can
begin by supporting automatic updates only on very simple
views with e.g a single table and a where clause. Then extend
that to support joins, aggregates, subqueries etc. Keeping it
really limited, you could even require the user to write the
required triggers himself.

That last bit doesn't strike me as much of an advance. Isn't
the whole point of this to automate it? Creating greedy
materialized views is usually not terribly difficult now, but
you do have to write the triggers.


Yeah, I agree.


It doesn't accomplish anything interesting on its own. But if you
do the planner changes to automatically use the materialized view
to satisfy queries (item 2. in my previous email), it's useful.


But you can't do that with a snapshot view, only a continuous updated
one.


If continuous updates prove to be too hard initially, you could instead
update the view on select if it's outdated. Such a materialized view
would be a kind of inter-session cache for subselects.

The hard part would probably be to figure out how to decide whether the
view is outdated or not, and to deal with two concurrent transactions
trying to use an outdates view (and both trying to refresh it). What
makes the second problem hard is that you wouldn't want one of the
transactions to wait for the other to complete, because this is not how
SELECTs traditionally behave.

best regards, Florian Pflug



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Project suggestion: benchmark utility for PostgreSQL

2007-03-17 Thread Florian G. Pflug

Mickael DELOISON wrote:

For a programming language, as it would be for GSoC, it has to be
realized in three month and I believe the utility has to be
cross-platform (anyway I want it to be). So I think Java would be
good. I am very used to Java and Swing programming. What do you think
about that choice? If you feel Java is a bad choice, there is
C++/Boost/wxWidget/ (like pgAdmin). But with wxWidget, I am not sure
if a GUI works under Windows and Linux it will work under MacOS
without hacks.


I just wanted to comment on the wxWidgest-under-OSX part. I works
pretty well, but of course there are some bugs in wxWidgets that
plague only a particular plattform. And since OSX seems to be one
of the lesser used one, there appear to be more bugs of this kind.

But if you grep through the pgAdmin3 sources, you'll find only
little OSX-specific code.

One nice thing about wxWidgets is that applications that use if
feel like native applications on the platforms that wxWidgets
supports. Java-GUIs often feel rather alien, at least IMHO.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] modifying the tbale function

2007-03-19 Thread Florian G. Pflug

Gregory Stark wrote:

"Andrew Dunstan" <[EMAIL PROTECTED]> writes:


Yes, more or less. That's what made me think of it.

OTOH, before we rush out and do it someone needs to show that it's a net win. I
agree with Tom that making tuplestore faster would probably be a much better
investment of time.


I don't think the problem with the tuplestore is a matter of speed. It's a
matter of scalability and flexibility. It limits the types of applications
that can use SRFs and the amount of data they can manipulate before it becomes
impractical.

Consider applications like dblink that have SRFs that read data from a slow
network sources. Or that generate more data than the server can actually store
at any one time. Or that overflow work_mem but are used in queries that could
return quickly based on the first few records.

Unfortunately, I don't think there's a simple fix that'll work for all PLs
using the current interface. Even languages with iterators themselves (python,
I think) probably don't expect to be called externally while an iterator is in
progress.


Just a thought - I believe that there are portable user-space thread 
implementations that contain little or no machine-specific code. What

if postgres used one of those to switch from the PL into the executor
and back after, say, 1000 rows were returned by the SFR?

What would be needed is basically some enhanced version of setjmp/longjmp
that actually saves the stack, and not just resets the stackpointer.

Since context switching would occur only at two well-defined places
(Some return_next_row function that PLs call when a SFR returns a row,
and in the executor if no more previously returned rows from that SFR
are available), this wouldn't introduce the usual multithreading
headache, but still allow to switch in and out of the PL interpreter.

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] modifying the tbale function

2007-03-19 Thread Florian G. Pflug

Andrew Dunstan wrote:

Florian G. Pflug wrote:
Just a thought - I believe that there are portable user-space thread 
implementations that contain little or no machine-specific code. What

if postgres used one of those to switch from the PL into the executor
and back after, say, 1000 rows were returned by the SFR?

What would be needed is basically some enhanced version of setjmp/longjmp
that actually saves the stack, and not just resets the stackpointer.

Since context switching would occur only at two well-defined places
(Some return_next_row function that PLs call when a SFR returns a row,
and in the executor if no more previously returned rows from that SFR
are available), this wouldn't introduce the usual multithreading
headache, but still allow to switch in and out of the PL interpreter.



This just sounds horribly fragile.


Why would it be? It's about the same as running postgresql in one thread,
and some PL in another. This should only cause trouble if both use some
non-reentrant libc-functions. But even that wouldn't matter because of
the well-defined context switching points.

Here is a paper about portable userspace threads that I just googled.
http://www.gnu.org/software/pth/rse-pmt.ps


Are we really sure that this isn't a solution in search of a problem?

I think this really depends on how you define "problem". Some people
might think that "select * from myfunc(...) limit 1" should stop and
return a result after myfunc(...) has returned one row. Others will
say "well, just use a different software design that doesn't depend
on this optimization".

greetings, Florian Pflug



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] modifying the tbale function

2007-03-19 Thread Florian G. Pflug

Gregory Stark wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:


Since context switching would occur only at two well-defined places
(Some return_next_row function that PLs call when a SFR returns a row,
and in the executor if no more previously returned rows from that SFR
are available), this wouldn't introduce the usual multithreading
headache...


Yes it would. Consider what happens if the PL function calls into SPI to
execute a query


I don't that would cause trouble. Postgres and the PL would still not run 
concurrently - in fact, from a control-flow point of view the PL would run at 
_exactly_ the same place as it is now. Precisely if someone tries to pull rows 
out of some executor node. The only difference is that now it's execution would

would be stopped and restarted multiple times by some sort of stack-switching.

There might be trouble if a second function has to be executed with the same PL
as an already running (but currently "stopped") function. This would only work
for PL that is thread-safe in some way.

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] modifying the tbale function

2007-03-20 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Here is a paper about portable userspace threads that I just googled.
http://www.gnu.org/software/pth/rse-pmt.ps


I'm not impressed --- the list of unsolved problems is a bit too long.
One that seems a showstopper for our purposes is lack of ability to
deal reliably with stack overflow on alternate stacks.  If we're going
to be doing anything as loosely defined as running a third-party
language interpreter in these stacks, that just won't do.

Another little problem is it's LGPL.


I don't think using that GnuPT library the paper mentions is a viable
approach for postgres. I just posted that link because it shows
that this is not impossible to do.

What would actually be interesting is to find out of the
ucontext/getcontext/makecontext/swapcontext/setcontext
functions mentioned in the paper are available on the
plattform where postgres is used in production for larger
dataset. I'd assume that people who need this PL optimization
will run a quite update-to-date version of their particular OS,
so maybe the portability problems of those functions wouldn't
be a problem for postgres - the PL optimization would just be
disabled at configure time if they are not available.

The main question is: IF the stability problems like stack overflow
can be addressed, would this be in principle considered to be feature
that people would like to have? Or is it considered not worth the effort?

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-21 Thread Florian G. Pflug

Pavan Deolasee wrote:

On 3/21/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:


On 3/21/07, Pavan Deolasee <[EMAIL PROTECTED]> wrote:
> It seems much simpler to me do something like this. But important
> question is whether the restriction that CREATE INDEX can not
> be run in a transaction block is acceptable ?

yikes -- this is huge, huge price to pay, IMHO.  Think about DDL that
implies index creation such as adding unique constraint to a table,
many postgresql users (including me) take advantage of that in update
systems to production databases.


I didn't understand that quite well. How does it help to run CREATE
INDEX inside a transaction ?

The problem is not so much CREATE INDEX per se, but other DDL commands
that implicitly create an index, for example adding a PRIMARY KEY to a
table.

Some applications that I have written would fail badly if CREATE INDEX
was disallowed inside a transaction - mostly, because they use plpgsql
functions to manipulate database objects, and disallowing CREATE INDEX
inside a transaction presumably means disallowing it from inside
stored procedures.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-23 Thread Florian G. Pflug

Pavan Deolasee wrote:

There is a slight hole in that SERIALIZABLE transactions won't be able

to use any indexes they build during their transaction, since they may
need to be able to see prior data, but I don't think anybody is going to
complain about that restriction. Anyone?


Oh, I did not see that. If there are no HOT-chains in the table, we can
set the xid to InvalidTransactionId so that the index is usable immediately
after it is created in the current transaction, as well as those
transactions
which were started before CREATE INDEX. We can possibly further
improve it by checking if there are no HOT-chains except those created
by this transaction and set xid to InvalidTransactionId. IMO with that we
shall address most of the use cases. There are few which might
still get impacted, but even for them there won't be any correctness
problem.


Why exactly can't a SERIALIZABLE transaction use the index it created
itself? If you add a pointer to the root of all HOT update chains where
either the HEAD is alive, or some tuple is visible to the transaction
creating the index, shouldn't this be sufficient for using the index
in the creating transaction?

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] tsearch_core for inclusion

2007-03-23 Thread Florian G. Pflug

Teodor Sigaev wrote:
For given schema and server's locale, it's possible to have several FTS 
configurations, but the only one (with special flag enabled)

could be used as default. Current (active) FTS configuration contains
in GUC variable tsearch_conf_name. If it's not defined, then FTS 
configuration
is looked in search_path to match server's locale with default flag 
enabled.


Isn't the real problem that only _one_ configuration per locale should
be marked as DEFAULT at any time, no matter what schema it is in?

Having one DEFAULT configuration per schema per locale will necessarily
cause confusion if search_path is not set carefully I think.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-23 Thread Florian G. Pflug

Pavan Deolasee wrote:

On 3/23/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote:



Why exactly can't a SERIALIZABLE transaction use the index it created
itself? If you add a pointer to the root of all HOT update chains where
either the HEAD is alive, or some tuple is visible to the transaction
creating the index, shouldn't this be sufficient for using the index
in the creating transaction?


Tuples in the HOT-chain may not share the same index keys with
respect to the new index being built (they share the same keys for
all existing indexes though). So we don't know which index key
to use while building the index.


Ah, of course - thanks for pointing that out.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] tsearch_core for inclusion

2007-03-26 Thread Florian G. Pflug

Oleg Bartunov wrote:

On Fri, 23 Mar 2007, Florian G. Pflug wrote:


Teodor Sigaev wrote:
For given schema and server's locale, it's possible to have several 
FTS configurations, but the only one (with special flag enabled)

could be used as default. Current (active) FTS configuration contains
in GUC variable tsearch_conf_name. If it's not defined, then FTS 
configuration
is looked in search_path to match server's locale with default flag 
enabled.


Isn't the real problem that only _one_ configuration per locale should
be marked as DEFAULT at any time, no matter what schema it is in?


I'm not sure I understand you correct (a bit complex :), but it's allowed
to have only _one_ DEFAULT configuration per schema/per locale. So,
visibility is defined by search_path for given locale.


Yes, but why is that needed? Wouldn't one DEFAULT configuration
per database be sufficient, and avoid the search_path problems?

Sorry if I'm being stupid - I just can't see what having a different
DEFAULT configuration per schema buys you.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] tsearch_core for inclusion

2007-03-26 Thread Florian G. Pflug

Teodor Sigaev wrote:
For given schema and server's locale, it's possible to have several FTS 
configurations, but the only one (with special flag enabled)

could be used as default. Current (active) FTS configuration contains
in GUC variable tsearch_conf_name. If it's not defined, then FTS 
configuration
is looked in search_path to match server's locale with default flag 
enabled.


Isn't the real problem that only _one_ configuration per locale should 
be marked as DEFAULT at any time, no matter what schema it is in?


Having one DEFAULT configuration per schema per locale will necessarily
cause confusion if search_path is not set carefully I think.

greetings, Florian Pflug

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Reduction in WAL for UPDATEs

2007-03-28 Thread Florian G. Pflug

Kenneth Marshall wrote:

We use DSPAM as one of our anti-spam options. Its UPDATE pattern is to
increment a spam counter or a not-spam counter while keeping the user and
token information the same. This would benefit from this optimization.
Currently we are forced to use MySQL with MyISM tables to support the
update load, although PostgreSQL 8.2 performance is right at the I/O
break-even point for switching databases. With HOT and more optimized
UPDATE I/O, 8.3 would give us enough I/O headroom to switch to PostgreSQL.


Interesting. I've switched from MySQL to PostgreSQL for dspam, because
of concurrency issues with MyISAM which caused bad performance.

I am eager to see how much HOT speeds of my setup, though ;-)

BTW, the "COMMIT NOWAIT" feature Simon Riggs proposed should provide
a huge speedup too, since dspam runs one transaction for each token
it has to update.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug

Simon Riggs wrote:

On Wed, 2007-03-28 at 22:24 +0530, Pavan Deolasee wrote:

Just when I thought we have nailed down CREATE INDEX, I realized 
that there something more to worry. The problem is with the HOT-chains

created by our own transaction which is creating the index. We thought
it will be enough to index the tuple at the head-of-the-chain since
that 
would be the visible copy once the transaction commits. We thought

of keeping the index unavailable for queries in pre-existing
transactions
by setting a new "xid" attribute in pg_index. The question is what
value 
to assign to "xid". I though we would assign ReadNewTransactionId().

>

If you are indexing a table that hasn't just been created by you, set
the xcreate field on pg_index at the *end* of the build using
ReadNewTransactionId(). Any xid less than that sees the index as
invalid. If you created the table in this transaction (i.e.
createSubId != 0) then set xcreate to creating xid.


Couldn't you store the creating transaction's xid in pg_index, and
let other transaction check that against their snapshot like they
would for any tuple's xmin or xmax? (With one exception - the creating
transaction would consider indices it built itself invalid, which
is not how things usually work for xmin/xmax).

This would mean that any transaction that believes that the creating
transaction has committed also consideres the index to be valid.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Couldn't you store the creating transaction's xid in pg_index, and
let other transaction check that against their snapshot like they
would for any tuple's xmin or xmax?


What snapshot?  I keep having to remind people that system catalog
operations are SnapshotNow.  In the particular context here, the
place where doing something else would break down is that the planner
has no idea when it makes a plan what snapshot(s) the plan might later
be used with.


Sorry - now that you say it, I remember that you've said that already
multiple times...

So the question is, why did this work until now, and CREATE INDEX+HOT
just doesn't seem to fit into this scheme?

I think the answer is that  all other DDL statements manage to assure
that any database objects they create or modify are usable for everybody
else immediatly after they are committed. This usually implies pretty
strong locking requirements - for example, I think that the core reason
why TRUNCATE needs an exclusive lock is precisely that guarantee it has
to make.

Maybe this could somehow be relaxed? Could, for example, the planner
be allowed to base some of it's decisions on the SerializableSnapshot
the every transaction (even read-only ones) posseses? It seems that
this would prevent plans from living longer than a transaction,
but maybe plan invalidation could help here?

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-28 Thread Florian G. Pflug

Pavan Deolasee wrote:

On 3/28/07, Tom Lane <[EMAIL PROTECTED]> wrote:


"Florian G. Pflug" <[EMAIL PROTECTED]> writes:
> Couldn't you store the creating transaction's xid in pg_index, and
> let other transaction check that against their snapshot like they
> would for any tuple's xmin or xmax?

What snapshot?  I keep having to remind people that system catalog
operations are SnapshotNow.  In the particular context here, the
place where doing something else would break down is that the planner
has no idea when it makes a plan what snapshot(s) the plan might later
be used with.


Tom, please correct me if I am wrong. But ISTM that this idea might
work in this context. In get_relation_info(), we would check if "xcreate"
xid stored in pg_index for the index under consideration is seen
committed with respect to the snapshot at that point of time.
Even if the snapshot changes later and index becomes valid, we
might not replan and hence not use index. But that doesn't seem
like a big problem to me.


That problem are usecases like
PREPARE my_plan  ;
BEGIN;
EXECUTE my_plan  ;
COMMIT ;

Is that "PREPARE" even run inside a transaction? Even if it is, it
probably won't have created a snapshot...

I think allowing the use of some sort of snapshot from inside the planner
would allow some locking to be relaxed, but there seems be a lot of
corner cases to consider :-(

OTOH, if you manage to make this work, a TRUNCATE that doesn't block
concurrent selects might become possible to do. This would for example
allow dropping and rebuilding subscriptions on a slony node while it
is in use.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug

Pavan Deolasee wrote:

In this specific context, this particular case is easy to handle because
we are only concerned about the serializable transactions started before
CREATE INDEX commits. If PREPARE can see the new index, it
implies that the CI transaction is committed. So the transaction
starting after than can only see the tuple version that we have indexed.


Yes, but the non-index plan PREPARE generated will be used until the end
of the session, nut only until the end of the transaction. Imagine that
it wasn't explicitly PREPARED (where you might say this is acceptable),
but rather just a query inside a plpgsql function, maybe even called
from some app using connection pooling. This means that the non-index
using plan might get used for a quite long time, which contradics the
work Tom did on plan invalidation I think.

Maybe Tom can comment on wheter it's possible to use plan invalidation
to eventually get rid of a stale plan in this context?

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug

Pavan Deolasee wrote:

On 3/29/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote:


Yes, but the non-index plan PREPARE generated will be used until the end
of the session, nut only until the end of the transaction.


Frankly I don't know this works, but are you sure that the plan will
be used until the end of the session ? Even if thats the case, it can
happen even today if we create a new index, but the existing sessions
will use the stale plan (assuming what you said is true)


I've checked that:

test=# prepare myplan as select * from test where id=1 ;
PREPARE

test=# explain execute myplan ;
 QUERY PLAN

 Seq Scan on test  (cost=0.00..22897.70 rows=5421 width=36)
   Filter: (id = 1)
(2 rows)

 Now I create an index in another session 

test=# explain select * from test where id=1 ;
  QUERY PLAN
--
 Bitmap Heap Scan on test  (cost=95.11..8248.45 rows=5000 width=36)
   Recheck Cond: (id = 1)
   ->  Bitmap Index Scan on idx  (cost=0.00..93.86 rows=5000 width=0)
 Index Cond: (id = 1)
(4 rows)

test=# explain execute myplan ; 
  QUERY PLAN


 Seq Scan on test  (cost=0.00..22897.70 rows=5421 width=36)
   Filter: (id = 1)
(2 rows)

!!! Index got used by the "select .. " but not by "execute myplan ..." !!!

test=# prepare myplan2 as select * from test where id=1 ;
PREPARE
test=# explain execute myplan2 ;
   QUERY PLAN
-
 Index Scan using idx on test  (cost=0.00..8.38 rows=1 width=37)
   Index Cond: (id = 1)
(2 rows)

!!! A newly prepared plan of course uses the index !!!



So yes, plans get cached until the end of the session, and
yes, 8.2 won't notice index creation either ;-)

The open question is how CVS HEAD with plan invalidation behaves.
If it replans after the index-creating transaction commits, then
basing index validity on a snapshot will break this, because upon
replay they index might not be useable, but later on it may very
well be (but that plan invalidation machinery won't realize that)

So this might not introduce a regression compared to 8.2, but to
a future 8.3 with plan invalidation...

Sorry for being so unclear in my previous emails - I had confused
myself ;-)

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug

Tom Lane wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

ISTM that the run-another-transaction-afterwards idea is the only one
that does everything I think we need. I really do wish we could put in a
wait, like CIC, but I just think it will break existing programs.


Actually, there's a showstopper objection to that: plain CREATE INDEX
has to be able to run within a larger transaction.  (To do otherwise
breaks "pg_dump --single-transaction", just for starters.)  This means
it can *not* commit partway through.


I believe the original idea was to invent some kind of "on commit run
this transaction" hook - similar to how files are deleted on commit,
I think. At least I understood the "Run another transaction on commit"
that way...

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-29 Thread Florian G. Pflug

Simon Riggs wrote:

On Thu, 2007-03-29 at 17:27 -0400, Tom Lane wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

ISTM that the run-another-transaction-afterwards idea is the only one
that does everything I think we need. I really do wish we could put in a
wait, like CIC, but I just think it will break existing programs.

Actually, there's a showstopper objection to that: plain CREATE INDEX
has to be able to run within a larger transaction.  (To do otherwise
breaks "pg_dump --single-transaction", just for starters.)  This means
it can *not* commit partway through.


The idea is to make note that the transaction has created an index
within a transaction block, so that after the top level transaction
commits we sneak in an extra hidden transaction to update the pg_index
tuple with the xcreate of the first transaction. 


The only other alternative is to forcibly throw a relcache inval event
in the same circumstances without running the additional transaction,
but the solution is mostly the same.


I think one alternative might be to store a list of xid's together with
a cached plan, and replan if the commit status (as percieved by the
transaction the plan will be executed in) of one of those xid's changes.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug

Tom Lane wrote:

"Pavan Deolasee" <[EMAIL PROTECTED]> writes:

How about storing the snapshot which we used during planning in
CachedPlanSource, if at least one index was seen unusable because
its CREATE INDEX transaction was seen as in-progress ?


I'm getting tired of repeating this, but: the planner doesn't use a
snapshot.  System catalogs run on SnapshotNow.


But it would still do that - it would just compare the createxid of
the index against some snapshot, and the query would be replanned
if the cached result of this comparison differs from the one the
current snapshot yields.

It might well be that this won't work, because the planner is invoked
in situations where there is no active snapshot - I'm not sure if your 
comment refers to that case, or not.


greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug

Tom Lane wrote:

"Pavan Deolasee" <[EMAIL PROTECTED]> writes:

What I am suggesting is to use ActiveSnapshot (actually
Florian's idea) to decide whether the transaction that created
index was still running when we started. Isn't it the case that
some snapshot will be "active" when we plan ?


I do not think you can assume that the plan won't be used later with
some older snapshot.  Consider recursive plpgsql functions for a
counterexample: the inner occurrence might be the first to arrive at
a given line of the function, hence the first to plan it, yet when we
return to the outer instance we might revert to an older snapshot.


So maybe we'd need to use the SerializableSnapshot created at the start
of each transaction for this check, and not the ActiveSnapshot? Could
that work?

What about doing
PREPARE myplan select ... ;
outside of a transaction? Will this be execute inside a transaction?
Is is a query always planned upon it's first execution, and not when
"PREPARE" is issued?

greetings, Florian Pflug



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

I do not think you can assume that the plan won't be used later with
some older snapshot.



So maybe we'd need to use the SerializableSnapshot created at the start
of each transaction for this check, and not the ActiveSnapshot? Could
that work?


That might work, but it doesn't seem to address the core objection:
there's no mechanism to cause the query to be replanned once the
snapshot is new enough, because no relcache inval will happen.  So
most likely existing backends will keep using old plans that don't
consider the index.


Pavan suggested storing the IndexSnapshot in the cached plan, and to 
compare it to the IndexSnapshot when the query is executed.

If those two snapshots differ, the query would be replanned.

My idea was to store a list of xid's together with the cached plan that
are assumed to be uncommitted accoring to the IndexSnapshot. The query
is replanned if upon execution the IndexSnapshot assumes that one of
these xid's is committed.

Those two ideas seem to be mostly equivalent, mine seems to be a bit
more fine-grained, but at the cost of more work upon each query execution.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] CREATE INDEX and HOT - revised design

2007-03-30 Thread Florian G. Pflug

Pavan Deolasee wrote:

On 3/30/07, Florian G. Pflug <[EMAIL PROTECTED]> wrote:


My idea was to store a list of xid's together with the cached plan that
are assumed to be uncommitted accoring to the IndexSnapshot. The query
is replanned if upon execution the IndexSnapshot assumes that one of
these xid's is committed.



Actually, if we are using Serializable Snapshot then there is no chance
to replan the query before the transaction completes and the next
transaction to start in the session must see the index and hence
we must replan. So it would be enough just to associate a transaction
id with the cached plan. If this xid is set and our transaction id is
different than that, we replan.


I believe this is true for the CREATE INDEX scenario. However, comparing
either the snapshot or the result of xid checks seems like it might
be useful for other things beside CREATE INDEX. I'm specifically 
thinking about TRUNCATE here - the create index + HOT problems sound

quite similar to the problems a non-exclusive-locking TRUNCATE would face.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Minor changes to Recovery related code

2007-03-30 Thread Florian G. Pflug

Simon Riggs wrote:

On Fri, 2007-03-30 at 16:34 -0400, Tom Lane wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

2. pg_stop_backup() should wait until all archive files are safely
archived before returning

Not sure I agree with that one.  If it fails, you can't tell whether the
action is done and it failed while waiting for the archiver, or if you
need to redo it.


There's a slight delay between pg_stop_backup() completing and the
archiver doing its stuff. Currently if somebody does a -m fast straight
after the pg_stop_backup() the backup may be unusable.

We need a way to plug that small hole.

I suggest that pg_stop_backup() polls once per second until
pg_xlog/archive_status/LOG.ready disappears, in which case it ends
successfully. If it does this for more than 60 seconds it ends
successfully but produces a WARNING.


I fear that ending sucessfully despite having not archived all wals
will make this feature less worthwile. If a dba knows what he is
doing, he can code a perfectly safe backup script using 8.2 too.
He'll just have to check the current wal position after pg_stop_backup(),
(There is a function for that, right?), and wait until the corresponding
wal was archived.

In realitly, however, I feare that most people will just create a script
that does 'echo "select pg_stop_backup | psql"' or something similar.
If they're a bit more carefull, they will enable ON_ERROR_STOP, and check
the return value of pgsql. I believe that those are the people who would
really benefit from a pg_stop_backup() that waits for archiving to complete.
But they probably won't check for WARNINGs.

Maybe doing it the other way round would be an option?
pg_stop_backup() could wait for the archiver to complete forever, but
spit out a warning every 60 seconds or so "WARNING: Still waiting
for wal archiving of wal ??? to complete". If someone really wants
a 60-second timeout, he can just use statement_timeout.

Anyway, just my 0.02 eurocents, maybe I'm totally mistaken about the
postgresql dba's out there...

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Fate of pgsnmpd

2007-04-06 Thread Florian G. Pflug

Hi

Does anyone know if pgsnmpd is still actively developed?
The last version (0.1b1) is about 15 months old.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] "select ('{}'::text[])[1]" returns NULL -- is it correct?

2007-04-10 Thread Florian G. Pflug

Nikolay Samokhvalov wrote:

On 4/10/07, Tom Lane <[EMAIL PROTECTED]> wrote:

"Nikolay Samokhvalov" <[EMAIL PROTECTED]> writes:
> I remember several cases when people (e.g. me :-) ) were spending some
> time trying to find an error in some pl/pgsql function and the reason
> lied in incorrect work with arrays (i.e. messages like "index is out
> of bounds" and "index cannot be negative number" would help, surely).

Well, if indexes *couldn't* be negative numbers then that might be
helpful, but they can.


Ooops :-) OK, my proposal is narrowing to very simple one: what about
triggering WARNINGs when user tries to access nonexistent element of
array?


Please don't ;-)
There are two sane options - return an error, or return NULL. Both are
sensible, and different programming languages make different choices.

The only reason for a WARNING would be a long-term plan to change the
existing behaviour. But this will cause lots of pain, for no real gain,
because no matter which behaviour you pick, there are always situations
where the other would be more convenient.

Just look at the mess PHP has created by altering fundamental aspects
of the language (4.4 -> 5.0).

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [DOCS] uuid type not documented

2007-04-10 Thread Florian G. Pflug

Neil Conway wrote:

On Tue, 2007-04-10 at 18:28 +0200, Peter Eisentraut wrote:
The problem is that most of the standard methods are platform dependent, as 
they require MAC addresses or a "good" random source, for instance.


http://archives.postgresql.org/pgsql-patches/2007-01/msg00392.php

ISTM random() or similar sources is a sufficient PSRNG for the purposes
of UUID generation -- I can't see anything in the RFC that would
contradict that.


Maybe a short-term solution could be a UUID-generated function that
takes some kind of seed as a parameter. People not concerned about
collisons could just pass some random value, while others could use
the mac-address of the client or something similar.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-12 Thread Florian G. Pflug

Hi

I'm very excited that my project for implementing read-only queries
on PITR slaves was accepted for GSoC, and I'm now trying to work
out what tools I'll use for that job.

I'd like to be able to create some sort of branches and tags for
my own work (only inside my local repository of course).

I've considered using git, but I couldn't make the cvs->git gateway
work - neither using the postgresql CVS repository directly, nor with
a private copy obtained with CVSup.

There is also svk, but I think I'd need a svn repo that mirrors
the postgresql CVS for that to work. I think Joshua Drake created
one once, but I don't now if it is kept up-to-date.

What do you guys use for your development work?

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-12 Thread Florian G. Pflug

Joshua D. Drake wrote:

Alexey Klyukin wrote:

Alvaro Herrera wrote:

But if you have a checked out tree, does it work to do an update after
the tree has been regenerated?  As far as I know, the repo is generated
completely every few hours, so it wouldn't surprise me that the checked
out copy is not compatible with the new repo.

I admit I haven't tried.
  

I have tried and svn up worked without issues.


As a note we will be updating this to subversion 1.4 shortly so people 
can do svnsync too.


Do I read this correctly as "This repository will stay around for a while,
and isn't just an experiment that might be stopped tomorrow?". If so, I'll
try using it - and lots of thanks for providing that

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-15 Thread Florian G. Pflug

Hi

When I try to build CVS HEAD on OSX 10.4, compiling
src/interfaces/ecpg/preproc/preproc.c fails with:
In file included from preproc.y:6951:
pgc.l:3:20: error: config.h: No such file or directory
In file included from pgc.l:28,
 from preproc.y:6951:
preproc.h:996: error: conflicting types for 'base_yylloc'
y.tab.c:18673: error: previous declaration of 'base_yylloc' was here
In file included from preproc.y:6951:
pgc.l:43: error: 'MAX_PARSE_BUFFER' undeclared here (not in a function)

If I delete pgc.c, it is rebuilt automatically, and then
preproc.c compiles just fine.

I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-15 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

When I try to build CVS HEAD on OSX 10.4, compiling
src/interfaces/ecpg/preproc/preproc.c fails with:
...
If I delete pgc.c, it is rebuilt automatically, and then
preproc.c compiles just fine.
...
I'm using gcc 4.0.1, flex 2.5.4 and bison 2.3


Perhaps you changed bison versions and didn't force a rebuild?
Those line numbers don't seem to sync up with my copies of the
derived files.


I just realized that this file isn't even in the postgresql CVS
repo. But it _is_ part of the SVN mirror at
https://projects.commandprompt.com/public/pgsql/repo.

The version that shows up in the trunk of the SVN repo is
the revision 1.5 from CVS
(http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c?rev=1.5;content-type=text%2Fplain;hideattic=0)

This is the same as
https://projects.commandprompt.com/public/pgsql/repo/trunk/pgsql/src/interfaces/ecpg/preproc/pgc.c
modulo the expansion of the $Header macro.

Seems to be a bug in the CVS->SVN conversion process...

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-16 Thread Florian G. Pflug

Alvaro Herrera wrote:

Ah, it seems the SVN repo just got its first user ;-)  Congratulations.
Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited
enough.


I hope the fact that I use the SVN repo just to get the changes into
git doesn't reduce my chances of getting that t-shirt ;-)

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Build-Problem with pgc.c on OSX 10.4

2007-04-16 Thread Florian G. Pflug

Alvaro Herrera wrote:

Florian G. Pflug wrote:

Alvaro Herrera wrote:

Ah, it seems the SVN repo just got its first user ;-)  Congratulations.
Ask Joshua to send you a Command Prompt tee shirt, maybe he is excited
enough.

I hope the fact that I use the SVN repo just to get the changes into
git doesn't reduce my chances of getting that t-shirt ;-)


Hum, why don't you just use the CVS directly then?  That'd avoid this
sort of infelicities.


git-cvsimport didn't work for me - neither with the main CVS repo, nor
with a rsync'ed copy.
It complained about all sorts of problems - I don't have enough CVS knowhow
to judge if those were actual problems with the repo, or just deficiencies
of git-cvsimport. Plus I didn't find a way to import the current version of
HEAD as one revision, any only go incrementally from there. It always wanted
to mirror the whole history stores in the CVS in my git repo, which is
overkill.

For SVN, there is git-svn, which does just what I want - I started with some
revision a few days ago, and it just incrementally imports updates from there
into a special branch of my git repo, and doesn't care about what happened
before that revision.

It's all not perfect, but I think for me it works better than just doing my
changes in a CVS checkout.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] What tools do people use to hack on PostgreSQL?

2007-04-16 Thread Florian G. Pflug

Joshua D. Drake wrote:

http://projects.commandprompt.com/public/pgsql/browser

or do the anonymous checkout with:

svn co http://projects.commandprompt.com/public/pgsql/repo/


But if you have a checked out tree, does it work to do an update after
the tree has been regenerated?  As far as I know, the repo is generated
completely every few hours, so it wouldn't surprise me that the checked
out copy is not compatible with the new repo.

I admit I haven't tried.


Not a clue. Anyone try it yet?


git-svn seems to work fine against the SVN repo, apart from the problem with
the files deleted in CVS which still show up in SVN.

It's only running for about two days though...

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug

Martin Langhoff wrote:

Hi Florian,

I am right now running an rsync of the Pg CVS repo to my work machine to
get a git import underway. I'm rather keen on seeing your cool PITR Pg
project go well and I have some git+cvs fu I can apply here (being one
of the git-cvsimport maintainers) ;-)

Cool - I'm new to git, so I really appreciate any help that I can get.


For the kind of work you'll be doing (writing patches that you'll want
to be rebasing onto the latest HEAD for merging later) git is probably
the best tool. That's what I use it for... tracking my experimental /
custom branches of projects that use CVS or SVN :-)

Thats how I figured I'd work - though I don't yet understand what
the advantage of "rebase" is over "merge".

Currently, I've setup a git repo that pulls in the changes from the SVN
repo, and pushed them to my main soc git repo. On that main repo I have
two branches, master and pgsql-head, and I call "cg-merge pgsql-head"
if I want to merge with CVS HEAD.


Initially, I'll post it on http://git.catalyst.net.nz/ and I can run a
daily import for you - once that's in place you can probably get a repo
with your work on http://repo.or.cz/

Having a git mirror of the pgsql CVS would be great.
BTW, I've just check out repo.or.cz, and noticed that there is already a
git mirror of the pgsql CVS: http://repo.or.cz/w/PostgreSQL.git

greetings + thanks
Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug

Aidan Van Dyk wrote:

Martin Langhoff wrote:
Well, now that more than one of us are working with git on PostgreSQL...

I've had a repo conversion running for a while...  I've only got it to what
I consider "stable" last week:
http://repo.or.cz/w/PostgreSQL.git
git://repo.or.cz/PostgreSQL.git

Ah - thats what I just stumbled over ;-)


For those interested int he conversion process, I've used a slightly
modified version of fromcvs (A ruby cvs to git/Hg tool), and it runs on all
of pgsql in about 20 minutes.

I gave up on git-svn (because of both speed and my in-ablility to
easy "filter" out Keywords, etc) and git-cvsimport (because cvsps doesn't
seem to like pgsql's repo)

Yeah, git-cvsimport didn't work for me either...

I "update" the git repo daily, based on an anonymous rsync of the cvsroot. 
If the anon-rsync is updated much more frequently, and people think my git

conversion should match it, I have no problem having cron run it more than
daily.

Also - note that I give *no* guarentees of it's integrity, etc.

I've "diffed" a CVS checkout and a git checkout, and the are *almost*
identical.  Almost, because it seems like my git repository currently has 3
files that a cvs checkout doesn't:
 backend/parser/gram.c |12088 +++
 interfaces/ecpg/preproc/pgc.c | 2887 ++
 interfaces/ecpg/preproc/preproc.c |16988 ++

And at this point, I haven't been bothered to see where those files came
from (and where they dissapear) in CVS and why my import isn't picking that
up...  I could probably be pushed if others find this repo really useful,
but those files problematic...

Thats interesting - the SVN mirror of the pgsql CVS at
http://projects.commandprompt.com/public/pgsql/browser
has exactly the same problem with those 3 files, as I found out the hard way ;-)

In the case of pgc.c, I've compared that revisions in CVS with the one in
SVN. SVN include the cvs-version 1.5 if this file in trunk, which seems to
be the last version of that file in CVS HEAD. Interestingly,
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/interfaces/ecpg/preproc/Attic/pgc.c
shows no trace of the file being deleted from HEAD either - it just shows
that it was removed from WIN32_DEV. But still a CVS checkout doesn't include 
that file...


Since 3 tools (cvsweb, git-cvsimport and whatever commandprompt uses to create
the SVN mirror) all come to the same conclusion regarding this file, I think
that this is caused by some corruption of the CVS repository - but I don't have
the cvs-fu to debug this...

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Re: IDENTITY/GENERATED v36 Re: [PATCHES] Final version of IDENTITY/GENERATED patch

2007-04-16 Thread Florian G. Pflug

Zoltan Boszormenyi wrote:

Tom Lane írta:

Zoltan Boszormenyi <[EMAIL PROTECTED]> writes:

Also, the current grammar is made to give a syntax error
if you say "colname type GENERATED BY DEFAULT AS ( expr )".
But it makes the grammar unbalanced, and gives me:
bison -y -d  gram.y
conflicts: 2 shift/reduce


I'ts been quite a time since I last used bison, but as far as I
remember, you can tell it to write a rather details log about
it's analysis of the grammar. That log should include more
detailed information about those conflicts - maybe that helps
to figure out their exact cause, and to find a workaround.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-16 Thread Florian G. Pflug

Tom Lane wrote:

Alvaro Herrera <[EMAIL PROTECTED]> writes:

These files are generated (from gram.y, pgc.l and preproc.y
respectievly) and are not present in the CVS repo, though I think they
have been at some point.



It's strange that other generated files (that have also been in the repo
in the past) like preproc.h are not showing up.


The weird thing about these files is that the CVS history shows commits
on HEAD later than the file removal commit.  I don't recall if Vadim
unintentionally re-added the files before making those commits ... but
if he did, you'd think it'd have taken another explicit removal to get
rid of them in HEAD.  More likely, there was some problem in his local
tree that allowed a "cvs commit" to think it should update the
repository with copies of the derived files he happened to have.

I think this is a corner case that CVS handles in a particular way and
the tools people are using to read the repository handle in a different
way.  Which would be a bug in those tools, since CVS's interpretation
must be right by definition.


The question is if it'd be acceptable to manually remove that last commit
from the repository. I guess simply readding, and then removing the files
again should do the trick, though I'd be cleaner to fix remove the
offending commit in the first place. Should postgres ever decide to switch
to another version control system (which I don't advocate), that'd be
one obstacle less to deal with...

Or is the risk of causing breakage too high?

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] RESET command seems pretty disjointed now

2007-04-16 Thread Florian G. Pflug

Tom Lane wrote:

Mark Kirkwood <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

The current documentation for RESET exhibits a certain lack of, um,
intellectual cohesiveness:

Synopsis

RESET configuration_parameter
RESET ALL
RESET { PLANS | SESSION | TEMP | TEMPORARY }


Maybe DISCARD for the plans etc might be more intuitive than extending 
RESET?


DISCARD PLANS and DISCARD TEMP seem pretty reasonable, but DISCARD SESSION
sounds a bit odd --- it seems like it might mean "disconnect", which of
course is exactly what we're trying to avoid.  But possibly we could
rename RESET SESSION as DISCARD ALL.

Leastwise I haven't got any better ideas.  Anyone have another proposal?


What about
RESET parameter
RESET { PLANS | TEMP | TEMPORARY }
RESET ALL { PARAMETERS | STATE }

RESET ALL would become an abbreviation of RESET ALL PARAMETERS (for backwards
compatibility), while RESET SESSION would be renamed to RESET ALL STATE.

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-17 Thread Florian G. Pflug

Martin Langhoff wrote:

Aidan Van Dyk wrote:

And remember the warning I gave that my conversion is *not* a direct CVS
import - I intentionally *unexpand* all Keywords before stuffing them
into GIT so that merging and branching can ignore all the Keyword
conflicts... 


My import is unexpanding those as well to support rebasing and merging
better.

So - if you are committed to providing your gateway long term to
Florian, I'm happy to drop my gateway in favour of yours.


There seem to be other people than me who are interested in a git
mirror. Maybe we could declare one of those mirrors the
"official" one - I guess things would be easier if all people
interested in using git would use the same mirror...

What do you guys think?


(Florian, before basing your code on either you should get a checkout of
Aidan's and mine and check that the tips of the branches you are working
on match the cvs branches -- the cvsimport code is good but whereever
CVS is involved, there's a lot of interpretation at play, a sanity check
is always good).

I actually hoped that I could just take my current git repo, and rebase
my branch onto one of those two repos - or does rebase only work from
an ancestor to a descendant?

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

2007-04-19 Thread Florian G. Pflug

Hi

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
 		 * Perform a checkpoint to update all our recovery activity to disk.
 		 *
 		 * Note that we write a shutdown checkpoint rather than an on-line
-		 * one. This is not particularly critical, but since we may be
-		 * assigning a new TLI, using a shutdown checkpoint allows us to have
-		 * the rule that TLI only changes in shutdown checkpoints, which
-		 * allows some extra error checking in xlog_redo.
+		 * one. A slave will always create a restart point if it sees a
+		 * shutdown checkpoint, and will call all rm_cleanup() methods before
+		 * it does so. This guarantees that any actions taken by the master
+		 * in rm_cleanup will also be carried out on the slave.
+		 * Additionally, we may be assigning a new TLI, so using a shutdow
+		 * checkpoint allows us to have the rule that TLI only changes in shutdown
+		 * checkpoints, which allows some extra error checking in xlog_redo.
 		 */
 		CreateCheckPoint(true, true);
 
@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
  * restartpoint is needed or not.
  */
 static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
 {
 	int			elapsed_secs;
 	int			rmid;
 
 	/*
-	 * Do nothing if the elapsed time since the last restartpoint is less than
-	 * half of checkpoint_timeout.	(We use a value less than
-	 * checkpoint_timeout so that variations in the timing of checkpoints on
-	 * the master, or speed of transmission of WAL segments to a slave, won't
-	 * make the slave skip a restartpoint once it's synced with the master.)
-	 * Checking true elapsed time keeps us from doing restartpoints too often
-	 * while rapidly scanning large amounts of WAL.
+	 * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+	 * have been written after the recovery following a crash of the master.
+	 * In that case, the master will have completed any actions that were
+	 * incomplete when it crashed *during recovery*, and these completions
+	 * are therefor *not* logged in the wal.
+	 * To prevent getting out of sync, we follow what the master did, and
+	 * call the rm_cleanup() methods. To be on the safe side, we then create
+	 * a RestartPoint, regardless of the time elapsed. Note that asking
+	 * the resource managers if they have partial state would be redundant
+	 * after calling rm_cleanup().
 	 */
-	elapsed_secs = time(NULL) - ControlFile->time;
-	if (elapsed_secs < CheckPointTimeout / 2)
-		return;
+	if (shutdownCheckpoint) {
+		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+		{
+			if (RmgrTable[rmid].rm_cleanup != NULL)
+RmgrTable[rmid].rm_cleanup();
+		}
+	}
+	else {
+		/*
+		 * Do nothing if the elapsed time since the last restartpoint is less than
+		 * half of checkpoint_timeout.	(We use a value less than
+		 * checkpoint_timeout so that variations in the timing of checkpoints on
+		 * the master, or speed of transmission of WAL segments to a slave, won't
+		 * make the slave skip a restartpoint once it's synced with the master.)
+		 * Checking true elapsed time keeps us from doing restartpoints too often
+		 * while rapidly scanning large amounts of WAL.
+		 */
+		elapsed_secs = time(NULL) - ControlFile->time;
+		if (elapsed_secs < CheckPointTimeout / 2)
+			return;
 
-	/*
-	 * Is it safe to checkpoint?  We must ask each of the resource managers
-	 * whether they have any partial state information that might prevent a
-	 * correct restart from this point.  If so, we skip this opportunity, but
-	 * return at the next checkpoint record for another try.
-	 */
-	for (rmid = 0; rmid <= RM

Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

2007-04-23 Thread Florian G. Pflug

Simon Riggs wrote:

On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.


I agree the problem exists. It is somewhat rare because the idea is that
if the Primary crashes we would failover to the Standby, which would
mean that both Primary and Standby have executed rm_cleanup(), if
needed.

So in the case where the Primary fails and we choose *not* to failover,
there is a potential difficulty on the Standby.


It occured to me today that this might plague 8.1 too. As you explain below,
the problem is not really connected to restartpoints - failing to create
them is merely a sympton of this. On 8.1, this might still lead to rm_cleanup()
being called much "later" (if you consider the wal position to be the "time")
on the slave than on the master. I dunno if this really causes trouble - I
don't yet understand the btree code well enough to judge this.


The rationale for this fix could be described somewhat differently:

When we shutdown, we know for certain that safe_restartpoint() is true.
However, we don't know whether it is true because we successfully did a
clean shutdown, or because we crashed, recovered and then issued a
shutdown checkpoint following recovery. In the latter case we *must*
execute rm_cleanup() on the standby because it has been executed on the
primary. Not doing so at this point *might* be safe, but is not certain
to be safe. We don't *need* to log a restartpoint at this time, but it
seems sensible to do so.


While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.


We need to check that rm_cleanup() routines don't assume that they will
only ever be called once or this will clearly fail. There is also no
need to call rm_cleanup() unless rm_safe_restartpoint() is false.


But a non-idempotent rm_cleanup() routine will cause trouble anyway,
if postgres crashes after having called rm_cleanup() but before creating
the ShutdownCheckpoint.

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-30 Thread Florian G. Pflug

Martin Langhoff wrote:

So - if you are committed to providing your gateway long term to
Florian, I'm happy to drop my gateway in favour of yours.



(Florian, before basing your code on either you should get a checkout of
Aidan's and mine and check that the tips of the branches you are working
on match the cvs branches -- the cvsimport code is good but whereever
CVS is involved, there's a lot of interpretation at play, a sanity check
is always good).


Sorry for responding so late - I was rather busy during the last 1 1/2 weeks
with university stuff, and had only very little time to spend on SoC.

I've tried to switch my repo to both git mirrors, but there seems to be 
something strange happening. The checkout pulls a _lot_ of objects (

a few hunder thousands), and then takes ages to unpack them all, bloating
my local repository (Just rm-ing my local repo takes a few minutes after
the checkout).

It seems as if git pulls all revisions of all files during the pull -
which it shouldn't do as far as I understand things - it should only
pull those objects referenced by some head, no?

The interesting thing is that exactly the same problem occurs with
both if your mirrors...

Any ideas? Or is this just how things are supposed to work?

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Hacking on PostgreSQL via GIT

2007-04-30 Thread Florian G. Pflug

Aidan Van Dyk wrote:

* Florian G. Pflug <[EMAIL PROTECTED]> [070430 08:58]:
 

It seems as if git pulls all revisions of all files during the pull -
which it shouldn't do as far as I understand things - it should only
pull those objects referenced by some head, no?


Git pulls full history to a common ancestor on the clone/pull.   So the
first pull on a repo *will* necessarily pull in the full object history.
So unless you have a recent common ancestor, it will pull lots.  Note
that because git uses crypto hashes to identify objects, my conversion
and Martin's probably do not have a recent common ancestor (because my
header munging probably doesn't match Martin's exactly).

Ah, OK - that explains things.


The interesting thing is that exactly the same problem occurs with
both if your mirrors...

Any ideas? Or is this just how things are supposed to work?


Until you have a local repository of it, you'll need to go through the
full pull/clone.  If you're really not interested in history you can
"truncate" history with the --depth option to git clone.  That will give
you a "shallow repository", which you can use, develop, branch, etc in,
but won't give you all the history locally.

I'll retry with the "--depth" option - I'm doing development on my powerbook,
and OSX seems to cope badly with lots of little files - the initial unpacking
took hours - literally..


Also - what version of GIT are you using?  I *really* recommend using at
least 1.5 (1.5.2.X is current stable).  Please, do your self a favour,
and don't use 1.4.4.

I'm using 1.5.0  currently - it was the latest stable release when I began
to experiment with git.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Heap page diagnostic functions

2007-05-01 Thread Florian G. Pflug

Zdenek Kotala wrote:
I did not find "forensics" in translator and It mentions in Oxford 
vocabulary but explanation is not clear for me. I agree with Bruce It is 
not good name. What about short form of diagnostic "diag"?


Doesn't forensics basically mean to find the cause of something
*after* it happened, based on traces that the event left behind?
Like finding the culprit of a crime done using for example
fingerprints he left, or tracing the actions of an intruder
by analyzing logfiles or modified binaries?

In that case, it doesn't accuratly describe those functions anyway
I think, because you call them from inside the database while it's
running, not from the outside after it was stopped or crashed.

Just the 2 eurocents of a non-native speaker...
Greetings, Florian Pflug



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Concurrently updating an updatable view

2007-05-14 Thread Florian G. Pflug

Richard Huxton wrote:

Richard Huxton wrote:

Heikki Linnakangas wrote:
The problem is that the new tuple version is checked only against the 
condition in the update rule, id=OLD.id, but not the condition in the 
original update-claus, dt='a'.


Yeah, that's confusing :(.


Bit more than just normal rule confusion I'd say. Try the following 
two statements in parallel (assuming you've just run the previous):


UPDATE test SET dt='c';
UPDATE test SET dt='x' FROM test t2 WHERE test.id=t2.id AND t2.dt='b';

This isn't a problem with the view mechanism - it's a problem with 
re-checking clauses involving subqueries or joins I'd guess.


I'm trying to decide if it's unexpected or just plain wrong, and I 
think I'd have to argue wrong.


Or perhaps I'd not argue that :-/

Well, src/backend/executor/README agrees with you that it's wrong..

"Note a fundamental bogosity of this approach: if the relation containing
the original tuple is being used in a self-join, the other instance(s) of
the relation will be treated as still containing the original tuple, whereas
logical consistency would demand that the modified tuple appear in them too.
But we'd have to actually substitute the modified tuple for the original,
while still returning all the rest of the relation, to ensure consistent
answers.  Implementing this correctly is a task for future work."

This is really about MVCC in read committed mode, and the "just right 
for simpler cases":
http://www.postgresql.org/docs/8.2/interactive/transaction-iso.html#XACT-READ-COMMITTED 

Clearly there needs to be a change to the sentence: "Because of the 
above rule, it is possible for an updating command to see an 
inconsistent snapshot: it can see the effects of concurrent updating 
commands that affected the same rows it is trying to update"


Not true if there's a subquery/join involved.

If the cited part of the README is correct, then all joins and subqueries
are fine, except if they refer to the table being updated.

I think there should be a big, fat warning that self-referential
updates have highly non-obvious behaviour in read-committed mode,
and should be avoided.

greetings, Florian Pflug



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Concurrently updating an updatable view

2007-05-15 Thread Florian G. Pflug

Richard Huxton wrote:

Hiroshi Inoue wrote:

Florian G. Pflug wrote:


I think there should be a big, fat warning that self-referential
updates have highly non-obvious behaviour in read-committed mode,
and should be avoided.


It seems pretty difficult for PostgreSQL rule system to avoid such
 kind of updates. I'm suspicious if UPDATABLE VIEWS can be implemented
 using the rule system.


Remember this affects all self-referential joins on an UPDATE (and 
DELETE?) not just views. It's just that a rule is more likely to produce 
that type of query.


Is there consensus what the correct behaviour should be for
self-referential updates in read-committed mode? Does the SQL Spec
have anything to say about this?

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Not ready for 8.3

2007-05-19 Thread Florian G. Pflug

Andrew Dunstan wrote:
What would making a branch actually do for you? The only advantage I can 
see is that it will give you a way of checkpointing your files. As I 
remarked upthread, I occasionally use RCS for that. But mostly I don't 
actually bother. I don't see how you can do it reasonably off a local 
cvs mirror - rsync will just blow away any changes you have checked in 
next time you sync with the master.


I don't think we can make CVS behave like a distributed SCM system, and 
ability to create local branches seems to me one of the fundamental 
points of such systems. If that's what the demand is for, then we should 
look again at moving to something like Mercurial.


I think the great thing about DCVS systems is that not everybody
necessarily needs to use the *same* system. And it doesn't really
matter what the central repository runs on - I think they are
gateway from/to nearly everything available...

I currently use GIT for my SoC project, and it works quite well -
I can create an abitrary number of local branches, and syncing
the currently active branch with CVS is archived by just doing
"cg-update pgsql-head".

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: Do we need a TODO? (was Re: [HACKERS] Concurrently updating anupdatable view)

2007-06-01 Thread Florian G. Pflug

Simon Riggs wrote:

On Mon, 2007-05-28 at 19:56 -0400, Bruce Momjian wrote:

Added to TODO:

* Fix self-referential UPDATEs seeing inconsistent row versions in
  read-committed mode

  http://archives.postgresql.org/pgsql-hackers/2007-05/msg00507.php



I'm sorry guys but I don't agree this is a TODO item.


Maybe the TODO suggested has a too narrow focus, but I think that
that *something* has to be done about this.


IMHO this follows documented behaviour, even if y'all are shocked.

Yes, but documented != sensible && documented != intuitive &&
documented != logical.


If you don't want the example cases to fail you can
- use SERIALIZABLE mode to throw an error if inconsistency is detected
- use SELECT FOR SHARE to lock the rows in the subselect
e.g.

UPDATE foo
SET pkcol = 'x'
WHERE pkcol IN 
(SELECT pkcol

 FROM foo
 
 FOR SHARE);

In the case of concurrent UPDATEs the second UPDATE will normally
perform the subSELECT then hang waiting to perform the UPDATE. If you
use FOR SHARE the query will hang on the subSELECT (i.e. slightly
earlier), which makes the second query return zero rows, as some of you
were expecting.


Sure, but with a similar argument you could question the whole
update-in-read-committed-mode logic. After all, you wouldn't need
that logic if you always obtained a share lock on the rows to be updated
*before* you started updating them.


Maybe we need a way of specifying that the non-UPDATE relation should be
locked FOR SHARE in a self-referencing UPDATE? Though that syntax could
seems to look pretty weird from here, so I'd say cover this situation in
a code example and be done.

Also, methinks we should have agreed behaviour before we make something
a TODO item. That would help us uncover this type of thing in more
detail, or at least force TODO to read "investigate whether ...".


Ack. Thats why I initially asked if there was consesus on what the
correct behaviour is.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] recovery_target_xid & crashes on the master

2007-06-04 Thread Florian G. Pflug

Hi

I'm currently working on splitting StartupXLog into smaller
parts, because I need to reuse some of the parts for concurrent
wal recovery (for my GSoC project)

The function recoveryStopsHere in xlog.c checks if we should
stop recovery due to the values of recovery_target_xid and
recovery_target_time. For recovery_target_xid, we stop if
we see a commit or abort record for the given xid.

Now I wonder what happens if an (admittely rather confused) DBA
uses an xid of a transaction that was aborted because of a
crash of the master as recovery_target_xid. The way I read the
code, postgres will just recover until it reaches the end of
the xlog in that case because neither an COMMIT nor an ABORT
for that xid exists in the WAL.

I'm not sure if this is worth fixing - it seems like a rather
contrived corner case - but I though I'd bring it up...

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-06 Thread Florian G. Pflug

Hi

This is a description of the GSoC work I've so for on readonly
support for PITR slaves. I'm looking for any kind of comments
on this - I want to make sure that I work in a direction that
the community approves.

Work done so far:
-
.) Added a new GUC operational_mode, which can be set to either
   readwrite or readonly. If it is set to readwrite (the default),
   postgres behaves as usual. All the following changes are only
   in effect if operational_mode is set to readonly.
.) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
   if postgre is not in readwrite mode. This macro protects the
   following functions to make sure that no writes occur in
   readonly mode.
 SimpleLruWritePage, SLruPhysicalWritePage
 EndPrepare, FinishPreparedTransaction
 XLogInsert, XLogWrite, ShutdownXLog
 CreateCheckpoint
 MarkBufferDirty.
 .) All transactions are set to readonly mode (An implicit
SET TRANSACTION READONLY), and are not allowed to do
SET TRANSACTION READWRITE.
 .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
is started, and it takes over that role that bgwriter play in the
shutdown process.
 .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
is considered to be "later" than any other xid.
 .) A global ReadOnlySnapshot is maintained in shared memory. This is
copied into backend local memory by GetReadonlySnapshotData (which
replaces GetSnapshotData in readonly mode).
  .) Crash recovery is not performed in readonly mode - instead, postgres
 PANICs, and tells the DBA to restart in readwrite mode. Archive
 recovery of course *will* be allowed, but I'm not that far yet.

Open Problems:
--
  .) Protecting MarkBufferDirty with ASSUME_OPMODE_READWRITE is troublesome,
 because callers usually call MarkBufferDirty from within a critical
 section, and thus elog(ERRROR) is turned into elog(PANIC). This e.g.
 happens with my patch if you call nextval() in readonly mode.
 Does anyone see a better solution then adding checks into
 all callers that are not otherwise protected from being called
 in readonly mode?
  .) Since the slaves needs to track an Snapshot in shared memory, it cannot
 resize that snapshot to accomodate however many concurrent transactions
 might have been running on the master. My current plan is to detect if
 that global snapshot overflows, and to lock out readonly queries on the
 slave (and therefore remove the need of keeping the snapshot current)
 until the number of active xids on the master has dropped below
 max_connections on the slave. A warning will be written to the postgres
 log that suggest that the DBA increases the max_connections value on
 the slave.

Please tell me what you think about this approach, and especially if you
see any problems that I overlooked.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-06 Thread Florian G. Pflug

Matthew T. O'Connor wrote:

Florian G. Pflug wrote:

Work done so far:
-
 .) Don't start autovacuum and bgwriter. 


Do table stats used by the planner get replicated on a PITR slave?  I 
assume so, but if not, you would need autovac to do analyzes.


Yes - everything that get wal-logged on the master gets replicated to
the slave. In my design, it isn't possible to do "analyze" on the slave,
because all datafiles are strictly readonly (well, with the small
exception of hit-bit updates actually).

greetings, Florian Pflug



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-06 Thread Florian G. Pflug

Jeff Davis wrote:

On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:

   .) Since the slaves needs to track an Snapshot in shared memory, it cannot
  resize that snapshot to accomodate however many concurrent transactions
  might have been running on the master. My current plan is to detect if
  that global snapshot overflows, and to lock out readonly queries on the
  slave (and therefore remove the need of keeping the snapshot current)
  until the number of active xids on the master has dropped below
  max_connections on the slave. A warning will be written to the postgres
  log that suggest that the DBA increases the max_connections value on
  the slave.


If we did lock the slave while waiting for transactions to complete on
the master, we'd need to document some stronger warnings against idle
transactions so that administrators could notice and correct the
problem.


It's not exactly locking until it complete on the master, it's locking
the slave until we reach a position in the wal on the slave with less
than max_connections concurrent transactions. But yes, I agree, this
will need to be documented.


Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.

On a further thinking - maybe locking out transactions isn't even
necessary - they would just continue to see the old global snapshot,
so time wouldn't advance for them until the number of concurrent
transactions decreases again.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug

Simon Riggs wrote:

On Wed, 2007-06-06 at 16:11 +0200, Florian G. Pflug wrote:


.) Added a new GUC operational_mode, which can be set to either
readwrite or readonly. If it is set to readwrite (the default),
postgres behaves as usual. All the following changes are only
in effect if operational_mode is set to readonly.


Do we need this? We are already InArchiveRecovery.

If I understand you correctly, you suggest that readonly queries
are allways allowed during archive recovery - so upon startup
postgres step through these states:
  .) Initial recovery (Until we reach a consistent state)
  .) Allow readonly queries
  .) Finish recovery in the background (might mean recovering "forever"
 on a PITR slave)
  .) Allow readwrite queries

My plan was to have a global switch, which lets you choose between
  .) All queries are readonly (Until the next postmaster restart at least),
 but you get background replay
  .) No background replay, but once replay is done, readwrite queries
 can be execute (Just what PG does now).

The main reason why I invented that global switch operational_mode was
to remove to need to switch between readonly mode and readwrite mode
on the fly.


.) Created a macro ASSUME_OPMODE_READWRITE that does elog(ERROR)
if postgre is not in readwrite mode. This macro protects the
following functions to make sure that no writes occur in
readonly mode.
  SimpleLruWritePage, SLruPhysicalWritePage
  EndPrepare, FinishPreparedTransaction
  XLogInsert, XLogWrite, ShutdownXLog
  CreateCheckpoint
  MarkBufferDirty.


These are Asserts?

The macro ASSUME_OPMODE_READWRITE just does
if (!OperationalModeReadWrite)
  elog(ERROR, ...)


  .) Don't start autovacuum and bgwriter. Instead of bgwriter, bgreplay
 is started, and it takes over that role that bgwriter play in the
 shutdown process.


Autovacuum -> understood.

What does bgreplay do? Why not just start bgwriter earlier and disable
some of its other functionality while InRecovery?

See above - it seemed simpler to clearly seperate


  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be "later" than any other xid.


So you are bumping FirstNormalTransactionId up by one for this?

In fact I changed MaxTransactionId to 0xFFFE, and set
ReadOnlyTransactionId to 0x. Additionally, I changed
IsNormalTransactionId to test not only for >= FirstNormalTransactionid,
but also for <= MaxTransactionId.


You're assuming then that we will "freeze" replay while we run a query?

No. My plan is to first get to a point where replay is freezes while
queries are running, and to then figure out a more intelligent way to do this.
I already have a few ideas how to do this, but I want to complete the "simple 
version", before I start with that work.



Otherwise doing this will mean the snapshot changes as a query executes.

Why? It's only the xid of the transaction, not it's xmin and xmax that are
set to ReadOnlyTransactionId.


  .) A global ReadOnlySnapshot is maintained in shared memory. This is
 copied into backend local memory by GetReadonlySnapshotData (which
 replaces GetSnapshotData in readonly mode).
   .) Crash recovery is not performed in readonly mode - instead, postgres
  PANICs, and tells the DBA to restart in readwrite mode. Archive
  recovery of course *will* be allowed, but I'm not that far yet.


This is the very heart of the matter. This isn't just a technical issue,
it goes to the heart of the use case for this feature. Can we recover
while running queries?

Yes. My comment only applies only to crash recovery - i.e, recovery that happens
*without* a recovery.conf present, after a crash.
It only really matters if you do following
  .) Start pg in readwrite mode.
  .) Kill it / It crashes
  .) Restart in readonly mode.

The main different between crash recovery, and recovery from a filesystem-level
backup is the additional information that the backup label gives us in the
second case - more specifically, the minRecoveryLoc that we read from the
backup label. Only with that knowledge is "recovering until we reach
a consistent state" a welldefined operation. And readonly queries
can only be executed *after* we did this "minimal recovery". So if there is
crash recovery to be done, we best we could do is to recover, and then start
in readonly mode. If this is *really* what the DBA wants, he can just start
in readwrite mode first, then cleanly shut PG down, and restart in readonly 
mode.

> If not, how much time will we spend in replay

mode v query mode? Will we be able to run long running queries *and*
maintain a reasonable time to recover? Is this a mechanism for providing
HA and additional query capacity, or is it just a mechanism for
additional query capacity only? Those are open questions to which I
don't have any answer

Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Florian G. Pflug wrote:

Jeff Davis wrote:

Are you referring to the size of the xip array being a problem? Would it
help to tie the size of the xip array to max_connections? I understand
that max_connections might be greater on the master, but maybe something
similar? 


Thats what I currently do - the xip array on the slave is sized to
hold max_connections entries (Actually, it's max_connections +
max_prepared_xacts I think). The problem occurs exactly if those
values are set too small on the slave - and since shared mem
objects are not resizeable, I don't see how the slave can handle
an xip overflow gracefully other than by not publishing the
information in shared memory as long as it doesn't fit there.


You could store the value of max_connections in the checkpoint xlog 
record, and read it from there in the slave. Though one could still 
change it on the master and restart without restarting the slave as well.


But AFAIK shmem allocation happens before recovery starts... Even if this
was solved, it would only be a partial solution since as you note, the
master might be restarted while the slave keeps running. So I think it's
better not too add too much complexity, and just tell the DBA to increase
max_connections on the slave, together with a comment in the documentation
never to sex max_connections smaller on the slave than on the master.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [RFC] GSoC Work on readonly queries done so far

2007-06-07 Thread Florian G. Pflug

Jeff Davis wrote:

On Wed, 2007-06-06 at 22:36 +0100, Simon Riggs wrote:

  .) Transactions are assigned a dummy xid ReadOnlyTransactionId, that
 is considered to be "later" than any other xid.

So you are bumping FirstNormalTransactionId up by one for this?

You're assuming then that we will "freeze" replay while we run a query?
Otherwise doing this will mean the snapshot changes as a query executes.


Is it possible to put a normal xmax for the snapshot?

It wouldn't be a real transaction on the slave, and also the master will
use that ID for a real transaction itself. However, I don't see a real
problem on the slave because it would only be used for the purpose of
the snapshot we need at that moment.


My plan is the following:
.) Initially, queries and recovery will run interleaved, but not concurrently.
   For that, an "empty" snapshot is sufficient, with
   xmin=xid=xmax=ReadOnlyTransactionId.
.) Then, I'll work on running them concurrently. The replay process will publish
   a "current" snapshot in shared memory, using "real" xmin and xmax values
   it generates by maintaining a list of currently active (as in: running when
   the wal was written on the master) transactions. In that case, only xid
   is set to ReadOnlyTransactionId.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Is this a feature?

2007-06-11 Thread Florian G. Pflug

Joshua D. Drake wrote:

Take the following:

INFO:  analyzing "pg_catalog.pg_authid"
INFO:  "pg_authid": scanned 1 of 1 pages, containing 5 live rows and 0 
dead rows; 5 rows in sample, 5 estimated total rows


The above is completely redundant. Why not just say:

INFO:  "pg_authid": scanned 1 of 1 pages, containing 5 live rows and 0 
dead rows; 5 rows in sample, 5 estimated total rows


If the first line is meant to be an indicator, then make the above line 
do this:


INFO: analyzing "pg_catalog.pg_authid" :

Don't add a new line, and when the next step of information comes up 
append it to the existing line to get:


INFO: analyzing "pg_catalog.pg_authid": scanned 1 of 1 pages, containing 
5 live rows and 0 dead rows; 5 rows in sample, 5 estimated total rows


But then the line could only be pushed to the client *after* the analysis
of the table has finished, while with the current output you know what
postgres is currently doing, because you get "analyzing ..." *before*
the operation starts.

greetings, Florian Pflug



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Selecting a constant question

2007-06-12 Thread Florian G. Pflug

Dann Corbit wrote:

-Original Message-
From: Hannu Krosing [mailto:[EMAIL PROTECTED]

Since libpq function PQfsize returns -2 for all constant character
strings in SQL statements ... What is the proper procedure to determine
the length of a constant character column after query execution but
before fetching the first row of data?

Why not just get the first row and determine the width from it before
you actually use any of tha data ?


What if the second row is 1000x longer?


Thats exactly the point. Consider
select mytext from mytable ;

How can PostgreSQL possibly know the maximum length
of the returned values *before* it has scanned the
whole table?

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] EXPLAIN omits schema?

2007-06-13 Thread Florian G. Pflug

Gregory Stark wrote:

"Joshua D. Drake" <[EMAIL PROTECTED]> writes:

I agree. XML seems like a fairly natural fit for this. Just as people should
not try to shoehorn everything into XML, neither should they try to shoehorn
everything into a relational format either.

Now all we need is an XML schema for it ;-)

Well I am not a big fan of XML but it certainly seems applicable in this
case.


I'm not a fan either so perhaps I'm biased, but this seems like a good example
of where it would be an *awful* idea.

Once you have an XML plan what can you do with it? All you can do is parse it
into constituent bits and display it. You cant do any sort of comparison
between plans, aggregate results, search for plans matching constraints, etc.

How would I, with XML output, do something like:

SELECT distinct node.relation 
  FROM plan_table 
 WHERE node.expected_rows < node.actual_rows*2;


or

SELECT node.type, average(node.ms/node.cost)
  FROM plan_table 
 GROUP BY node.type;


I believe that XQuery actually supports such queries. So if postgres
supported XQuery (or does it already? I honestly don't know), writing
such a query wouldn't be that hard I think. The execution probably
won't be super-efficient, but for query plans that seems OK.

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Controlling Load Distributed Checkpoints

2007-06-13 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Jim C. Nasby wrote:

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:

Heikki Linnakangas <[EMAIL PROTECTED]> writes:
Thinking about this whole idea a bit more, it occured to me that the 
current approach to write all, then fsync all is really a historical 
artifact of the fact that we used to use the system-wide sync call 
instead of fsyncs to flush the pages to disk. That might not be the 
best way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.


If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.


I doubt it makes that much difference. If there was a significant amount 
of fragmentation, we'd hear more complaints about seq scan performance.


OTOH, extending a relation that uses N pages by something like
min(ceil(N/1024), 1024)) pages might help some filesystems to
avoid fragmentation, and hardly introduce any waste (about 0.1%
in the worst case). So if it's not too hard to do it might
be worthwhile, even if it turns out that most filesystems deal
well with the current allocation pattern.

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Worries about delayed-commit semantics

2007-06-22 Thread Florian G. Pflug

PFC wrote:

On Fri, 22 Jun 2007 16:43:00 +0200, Bruce Momjian <[EMAIL PROTECTED]> wrote:

Simon Riggs wrote:

On Fri, 2007-06-22 at 14:29 +0100, Gregory Stark wrote:
> "Joshua D. Drake" <[EMAIL PROTECTED]> writes:
> > Tom Lane wrote:
> >> untrustworthy disk hardware, for instance.  I'd much rather use names
> >> derived from "deferred commit" or "delayed commit" or some such.
> >
> > Honestly, I prefer these names as well as it seems directly related versus
> > transaction guarantee which sounds to be more like us saying, if we turn it 
off
> > our transactions are bogus.

That was the intention..., but name change accepted.

> Hm, another possibility: "synchronous_commit = off"

Ooo, I like that. Any other takers?


Yea, I like that too but I am now realizing that we are not really
deferring or delaying the "COMMIT" command but rather the recovery of
the commit.  GUC as full_commit_recovery?


commit_waits_for_fsync =

force_yes: makes all commits "hard"
yes: commits are "hard" unless specified otherwise [default]
no: commits are "soft" unless specified otherwise [should 
replace fsync=off use case]
force_no: makes all commits "soft" (controller with write cache 
"emulator")


I think you got the last line backwards - without the fsync() after
a commit, you can't be sure that the data made it into the controller
cache. To be safe you *always* need the fsync() - but it will probably
be much cheaper if your controller doesn't have to actually write to
the disks, but can cache in battery-backed ram instead. Therefore,
if you own such a controller, you probably don't need deferred commits.

BTW, I like synchronous_commit too - but maybe asynchronous_commit
would be even better, with inverted semantics of course.
The you'd have "asynchronous_commit = off" as default.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Worries about delayed-commit semantics

2007-06-22 Thread Florian G. Pflug

Richard Huxton wrote:

Bruce Momjian wrote:

Tom Lane wrote:

What's wrong with synchronous_commit?  It's accurate and simple.


That is fine too.


My concern would be that it can be read two ways:
1. When you commit, sync (something or other - unspecified)
2. Synchronise commits (to each other? to something else?)*

It's obvious to people on the -hackers list what we're talking about, 
but is it so clear to a newbie, perhaps non-English speaker?


* I can see people thinking this means something like "commit_delay".


OTOH, the concept of synchronous vs. asynchronous (function) calls
should be pretty well-known among database programmers and administrators.
And (at least to me), this is really what this is about - the commit
happens asynchronously, at the convenience of the database, and not
the instant that I requested it.

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Still recommending daily vacuum...

2007-07-05 Thread Florian G. Pflug

Michael Paesold wrote:

Alvaro Herrera wrote:

So what you are proposing above amounts to setting scale factor = 0.05.
The threshold is unimportant -- in the case of a big table it matters
not if it's 0 or 1000, it will be almost irrelevant in calculations.  In
the case of small tables, then the table will be vacuumed in almost
every iteration if the threshold is 0, which is fine because the table
is small anyway.  So why not let the threshold be 0 and be done with it?


For very small tables, setting a threshold of 0 could mean a vacuum 
after every single row update (or every other row). I think that is just 
burning cycles. What about a threshold of 10 or 50, to have at least 
some sanity limit? Even though the cost of vacuum of a small table is 
low, it is still not free, IMHO, no?


A bit off-topic (because probably not realistic in a 8.3 timeframe) -
but maybe the threshold should be specified in terms of "expected number of
pages to be freed", instead specifing a bias for the number of modified
rows as it is done now. Then "1" would probably be a reasonable default, because
a vacuum that won't free at least one page seems to be not really worth
the effort - it won't safe any future IO bandwith.

Just an idea I got while following this thread...

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

2007-07-05 Thread Florian G. Pflug

Tom Lane wrote:

[ back to dealing with this patch, finally ]

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.


I concur that this is a good idea --- we should have a third checkpoint
record type that shows that a crash recovery occurred.  However, we can
probably only do that for 8.3 and beyond.  If we try to do it in
existing release branches then there's likelihood of trouble due to WAL
incompatibility between master and standby.  While we do advise people
to update their standbys first, I don't think it's worth risking such
problems just to add some more error checking.

>

Conclusion: we should apply Florian's patch as-is in 8.2, do something
morally equivalent in 8.1 and before, and invent a
CrashRecoveryCheckpoint record type in HEAD.


Sounds good.

Do you want me to code up such patches for 8.1 and 8.3 in the next days,
or is someone else already working on it?

greetings, Florian Pflug



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

2007-07-05 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Conclusion: we should apply Florian's patch as-is in 8.2, do something
morally equivalent in 8.1 and before, and invent a
CrashRecoveryCheckpoint record type in HEAD.



Sounds good.


Actually, now that I look closer, this patch seems completely wrong.
It's predicated on an assumption that rm_cleanup won't write WAL entries
describing what it did ... but, at least in the btree case, it does.
(I think gist/gin might not, but that's a bug in those AMs not in xlog.)
I'm therefore wondering what test case led you to think there was
something wrong.


It wasn't a testcase - I was trying to understand the xlog code while working
on my concurrent walreplay patch, and wondered what happens if the master 
crashes and then recovery while the slave keeps running.


I've re-read my original email to Simon, and it seems that I believed
that rm_cleanup methods won't bee able to write to the xlog because they are
called during recovery.

But StartupXLOG *does* make the wal append able *before* the rm_cleanup methods
are called.

So I now think (at least for btree) that everything is fine, and that I was
just being stupid.

Sorry for the noise, guys
greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] ReadRecord, EndRecPtr and XLOG_SWITCH

2007-07-09 Thread Florian G. Pflug

Hi

When ReadRecord encounters an XLOG_SWITCH record, it does
EndRecPtr.xrecoff += XLogSegSize - 1;
EndRecPtr.xrecoff -= EndRecPtr.xrecoff % XLogSegSize;
which seems to set xrecoff to either 0 (if it was 0)
or to XLogSegSize (if it was > 0).

Note that xrecoff == XLogSegSize is kind of "denormalized" -
the normalized version would have xrecoff == 0, and xlogid = xlogid+1

Passing this "denormalized" EndRecPtr to ReadRecord again
to read the next record than triggers a PANIC
("invalid record offset at ??/1000"). Passing NULL to ReadRecord
to read the next record works, because it takes care to align the
EndRecPtr to the next page, thereby fixing the "denormalization".

Is there a reason not to do the same for non-NULL arguments to
ReadRecord? Or is there some failure case that the current
behaviour protects against?

The reason I stumbled over this is that I want to restart archive
recovery from a "bgreplay" process - I tried passing the EndRecPtr
via shmem, and using it as my initial argument to ReadRecord, and
thereby stumbled over this behaviour.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] ReadRecord, EndRecPtr and XLOG_SWITCH

2007-07-09 Thread Florian G. Pflug

Florian G. Pflug wrote:


Please disregard - I was confusing xlogid with xlog segments, so
most of my mail was nonsense.

I've fixed my problem by storing not the EndRecPtr, but rather
the ReadRecPtr, in shmem and rereading the last already applied
record in my bgreplay process. Then I can just use ReadRecord(NULL),
and things Just Work.

Sorry for the noise & greetings
Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] "Running readonly queried on PITR slaves" status update

2007-07-09 Thread Florian G. Pflug

Hi

After struggling with understanding xlog.c and friends far enough to
be able to refactor StartupXLOG to suit the needs of concurrent recovery,
I think I've finally reached a workable (but still a bit hacky) solution.

My design is centered around the idea of a bgreplay process that
takes over the role of the bgwriter in readonly mode, and continously
replays WALs as they arrive. But since recovery during startup is
still necessary (We need to bring a filesystem-level backup into a
consistent state - past minRecoveryLoc - before allowing connections),
this means doing recovery in two steps, from two different processes.

I've changed StartupXLOG to only recover up to minRecoveryLoc in readonly
mode, and to skip all steps that are not required if no writes to
the database will be done later (Especially creating a checkpoint at
the end of recovery). Instead, it posts the pointer to the last recovered
xlog record to shared memory.

bgreplay than uses that pointer for an initial call to ReadRecord to
setup WAL reading for the bgreplay process. Afterwards, it repeatedly
calls ReplayXLOG (new function), which always replays at least
one record (If there is one, otherwise it returns false), until
it reaches a safe restart point.

Currently, in my test setup, I can start a slave in readonly mode and
it will do initial recovery, bring postgres online, and continously
recover from inside bgreplay. There isn't yet any locking between
wal replay and queries.

I'll add that locking during the new few days, which should result
it a very early prototype. The next steps will then be finding a way
to flush backend caches after replaying code that modified system
tables, and (related) finding a way to deal with the flatfiles.

I'd appreciate any comments on this, especially those pointing
out problems that I overlooked.

greetings, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] 2PC-induced lockup

2007-07-11 Thread Florian G. Pflug

Tom Lane wrote:

"Simon Riggs" <[EMAIL PROTECTED]> writes:

I'd be much more comfortable if LOCK TABLE caused a message to the log
if it is executed on any system table.


Enabled by "set training_wheels = on", perhaps?

This is really pretty silly to be getting worked up about.  The command
in question wouldn't have been allowed at all except to a superuser,
and there are plenty of ways to catastrophically destroy your database
when you are superuser; most of which we will never consider blocking
for the same reasons that Unix systems have never tried to block root
from doing "rm -rf /".  I'd say the real design flaw in Peter's
referenced application is that they're running it as superuser.


Yeah.. though "lock pg_auth; prepare" looks quite innocent, much more
than say "delete from pg_database" or "rm -rf whatever".
At least to the untrained eye.

I fully agree that that special-casing this particular way to shoot yourself
in the foot is not worth it - but maybe pursuing a more general solution
would be worthwile? Maybe superuser-connections could e.g. ignore
any errors that occur while reading a system table, together with
a big, fat warning, but still allow a logon? That of course depends
on the assumption that basic authentication is possible using just
the information from the flatfiles and pg_hba.conf, which I'm not
sure about.

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] xlog switch & last record before the switch

2007-07-12 Thread Florian G. Pflug

Hi

To test my PITR-slave readonly-query patch, I continously do
insert into test ...
pg_switch_xlog()
sleep 1
on the master, and let the slave process the generated xlogs

The log output on the slave looks the following (unnecessary lines remove)

LOG:  restored log file "0001016E" from archive
LOG:  REDO @ 0/16E00020; LSN 0/16E0004C: prev 0/16D00098; xid 1346: Transaction
- commit: 2007-07-12 15:43:00.686056+02
LOG:  REDO @ 0/16E0004C; LSN 0/16E0008C: prev 0/16E00020; xid 0: XLOG -
checkpoint: redo 0/16E00020; tli 1; xid 0/1347;
 oid 24576; multi 1; offset 0; online
LOG:  REDO @ 0/16E0008C; LSN 0/16E00140: prev 0/16E0004C; xid 1350: Sequence -
log: rel 1663/1/16384
LOG:  REDO @ 0/16E00140; LSN 0/16E00B88: prev 0/16E0008C; xid 1350; bkpb1: Heap
- insert: rel 1663/1/16386; tid 2/49
LOG:  REDO @ 0/16E00B88; LSN 0/16F0: prev 0/16E00140; xid 1350: XLOG - xlog
switch

LOG:  restored log file "0001016F" from archive
LOG:  REDO @ 0/16F00020; LSN 0/16F0004C: prev 0/16E00B88; xid 1350: Transaction
- commit: 2007-07-12 15:43:02.159717+02
LOG:  REDO @ 0/16F0004C; LSN 0/16F00098: prev 0/16F00020; xid 1352: Heap -
insert: rel 1663/1/16386; tid 2/50
LOG:  REDO @ 0/16F00098; LSN 0/1700: prev 0/16F0004C; xid 1352: XLOG - xlog
switch

LOG:  restored log file "00010170" from archive
LOG:  REDO @ 0/1720; LSN 0/174C: prev 0/16F00098; xid 1352: Transaction
- commit: 2007-07-12 15:43:02.26456+02
LOG:  REDO @ 0/174C; LSN 0/1798: prev 0/1720; xid 1356: Heap -
insert: rel 1663/1/16386; tid 2/51
LOG:  REDO @ 0/1798; LSN 0/1710: prev 0/174C; xid 1356: XLOG - xlog
switch

As you can see, the COMMIT records seems to end up being logged *after*
the xlog switch. I would have expected the order
"heap-insert, commit, switch, heap-insert, commit, switch, ...", not
"heap-insert, switch, commit, heap-insert, switch, commit, ...".

Is this the expected behaviour, or just an artefact of the implementation
of xlog switches?

greetings, Florian Pflug


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] xlog switch & last record before the switch

2007-07-12 Thread Florian G. Pflug

Simon Riggs wrote:

On Thu, 2007-07-12 at 16:17 +0200, Florian G. Pflug wrote:


To test my PITR-slave readonly-query patch, I continously do
insert into test ...
pg_switch_xlog()
sleep 1
on the master, and let the slave process the generated xlogs

The log output on the slave looks the following (unnecessary lines remove)

>>
>> >


As you can see, the COMMIT records seems to end up being logged *after*
the xlog switch. I would have expected the order
"heap-insert, commit, switch, heap-insert, commit, switch, ...", not
"heap-insert, switch, commit, heap-insert, switch, commit, ...".

Is this the expected behaviour, or just an artefact of the implementation
of xlog switches?


Can you show the exact SQL executed? 


If you do INSERT ... ;select pg_switch_xlog() then the COMMIT will occur
after the switch. If you do:
INSERT ...;
select pg_switch_xlog();
then the COMMIT will occur before the switch.


Ah, you caught me - I had forgotten that pgsql -c ""
executes the statement in one transaction.

I was just going to suggest that pg_start_backup, pg_stop_backup
and pg_switch_xlog emit a warning or even an error if called
from within a larger transaction, because that's quite certainly
not what the user wants. But since those are just plain functions,
I guess checking for that might not be trivial...


If it were otherwise this would imply statements were executed prior to
the previous commit, which I hope and pray is never the case.

You can relax, the bug was located in front of the screen :-)

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


[HACKERS] Full page images in WAL & Cache Invalidation

2007-07-22 Thread Florian G. Pflug

Hi

I'm currently working on correctly flushing the
catalog/relation/sgmr caches on a readonly PITR
slave during recovery. These are the things that
currently cause me headache.

1) It seems that the btree code sends out relcache
   invalidation msgs during normal operation
   (No DDL statements are executed). This lets any
   simple flush-all-caches-if-ddl-was-execute scheme
   fail.

2) When a full page image is written to the wal, the
   information about what tuple was updated is lost.
   So synthesizing cache invalidation msgs from the
   WAL records would need to reverseengineer a full
   page image, which seems hard and errorprone.

3) Most cache invalidations seem to be generated by
   heap_insert (via PrepareForTupleInvalidation). Those
   seems to be reconstructable from the WAL quite easily.
   Those sent out via CacheInvalidateRelcache*, however,
   seem to leave no trace in the WAL.

What I'm wondering is how much performance is lost if
I just let the slave flush all it's caches whenever it
replayed a commit record of a transaction that executed
DDL. To me it looks like that would only seriously harm
performance if a lot of temporary tables are created on
the master. Since there seem to be quite people who are
unhappiy about the current temptable implementation,
optimizing for that case might prove worthless if 8.4 or
8.5 will change the way that temptables are handled.

If this brute-force approach turns out to perform really
bad, does anyone see an elegant way around (2) and (3)?
(2) seems solveable by writing logical and physical records
to the wal - similar to what that xlog compression idea
needs (I, however, lost track of what came out of that
discussion). But (3) seems to be messy..

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-22 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I'm currently working on correctly flushing the
catalog/relation/sgmr caches on a readonly PITR
slave during recovery.


I don't believe there is any workable solution to that short of logging
cache-flush operations in WAL.


I still don't fully understand if and what problems are caused by overly 
aggresive cache flushing - what puzzles me is that DDL statements seems

to be considered something rare on a production system by most people on
this list, yet the caches seem to be highly optimized to avoid 
unnecessary invalidates.


Leaving aside the btree issues, are you worried about performance
problems, or can aggressive cache flushing hurt correctness?

The reason that I dislike WAL-logging of the flush operations so much is
that it since peopel are concerned about the amount of wal traffic 
postgres generated, such a solution would introduce yet another GUC.

And to make this reasonable foolproof, the slave would need a way to
detect if that GUC is set correctly on the master. All in all, that
seems to be quite hackish...

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-22 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Leaving aside the btree issues, are you worried about performance
problems, or can aggressive cache flushing hurt correctness?


It shouldn't hurt correctness, but I don't think you entirely grasp
the magnitude of the performance hit you'll take.  The last time I
tried running the regression tests with CLOBBER_CACHE_ALWAYS enabled,
they took about one hundred times longer than normal.  Now you are
evidently hoping not to disable the caches entirely, but I don't
understand how you are going to handle the various random
CacheInvalidateRelcache calls that are here and there in the system
(not only in btree).


I must be missing something... A quick grep for CacheInvalidateRelcache
turned of these places:
src/backend/rewrite/rewriteDefine.c (EnableDisableRule)
src/backend/rewrite/rewriteSupport.c (SetRelationRuleStatus)
src/backend/access/nbtree/nbtinsert.c (_bt_insertonpg, _bt_newroot)
src/backend/access/nbtree/nbtpage.c (_bt_getroot, _bt_pagedel)
src/backend/commands/trigger.c (renametrig, EnableDisableTrigger)
src/backend/commands/cluster.c (mark_index_clustered)
src/backend/commands/indexcmds.c (DefineIndex)
src/backend/commands/tablecmds.c (setRelhassubclassInRelation)
src/backend/commands/vacuum.c (vac_update_relstats)
src/backend/catalog/heap.c (SetRelationNumChecks)
src/backend/catalog/index.c (index_drop, index_update_stats)

For CacheInvalidateHeapTuple, there is an additional hit in
src/backend/commands/vacuum.c (move_chain_tuple, move_plain_tuple).
Note that move_chain_tuple and move_plain_tuple are only called
in repair_frag, which in turn is only used in full_vacuum_rel.

Now, to me all of these with the exception of the btree functions,
vac_update_relstats and move_chain/plain_tuple look like they are only
called during DDL statements.

My basic assumption is that DDL is something quite uncommon on a
production system. This is obviously *totally* *wrong* for the
regression tests, and I don't doubt that my scheme will show quite
bad performance if you use that as a benchmark. But if you, say,
use pgbench for testing, than the picture will be quite different
I imagine.

My strategy would be the following
1) Mark the commit record if a transaction generated any invalidation
   events apart from the btree ones. The only other source of inval
   events seems to be "VACUUM FULL" on a system relation, which won't
   happen on a modestly well-tuned system I think - any VACCUM FULL
   will need a special treatement anyway.
2) At replay time, the caches are flushed after that record was
   replayed.

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-22 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

My basic assumption is that DDL is something quite uncommon on a
production system.


I'm not sure I believe that, because of temp tables.  There's also
the problem that plain VACUUM (or ANALYZE) causes a relcache flush
to update the relation-size statistics.

>

The real problem with the scheme you propose is that it turns a
cache flush on one table into a system-wide cache flush.


Yes.. It really builds on the idea that those flushes happen not
too frequently.


We might be able to do something about the temp-table case upstream:
AFAICS there's no reason for backends to broadcast cache flushes for
their own temp tables to other backends.  But that's just a sketch
of a thought at the moment.


I was actually hoping that some day temptables wouldn't be stored in
the pg_class and friends at all. I was actually wondering if it
wouldn't be possible to keep the information about them soley in
the catcache and relcache (Making the word cache a big lie). Didn't
check if that is feasible at all, though - just an idea I got at
one point.


Anyway, if you believe that DDL is infrequent, why are you resistant
to the idea of WAL-logging cache flushes?

For multiple reasons.

First, cache invalidations are not the only problem caused by replaying 
system-table updates. The whole SnapshotNow

business doesn't exactly make things easier too. So it feels like a
lot of added complexity and code for little gain - unless a *lot*
more things (like locking requests) are logged too.

Second, I'm sure that people would insist on a GUC to turn logging
those records off if they don't need them in their setup. Similar to
that make_wal_compressable GUC that was proposed a few weeks ago.
And if it's a GUC, the slave should be able to verify that it was
set correctly on the master, otherwise this becomes a huge footgun.

Third, I try to keep the changes necessary on the master at a
minimum - I feel that this will make merging the code at some point
easier, because the risk of breaking something is smaller. Bugs
in the slave code will maybe cause crashes and wrong results, but
at least they won't cause data corruption on the master.

And last but not least, I have only limited time for this project -
so I try to find the simplest workable solution, and maybe tune things
later when pratical experience shows where the real bottlenecks are.

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-23 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Anyway, if you believe that DDL is infrequent, why are you resistant
to the idea of WAL-logging cache flushes?


First, cache invalidations are not the only problem caused by replaying 
system-table updates. The whole SnapshotNow

business doesn't exactly make things easier too. So it feels like a
lot of added complexity and code for little gain - unless a *lot*
more things (like locking requests) are logged too.


The mention of locking requests brought to mind the following
gedankenexperiment:

1. slave server backend is running some long-running query on table X.

2. WAL-reading process receives and executes DROP TABLE X.

(It doesn't even have to be a DROP; most varieties of ALTER are enough
to create problems for a concurrently-running query.)

It's really hard to see how to defend against that without a fairly
complete simulation of locking on the slave side.


Well, locking on the slave is a bit easier than on the master, for
two reasons
1) Queries running on the slave only lock in  AccessShareLock mode -
   therefore, you only need to know if a certain mode conflics with
   AccessShareLock - and there seems to be only one that does, namely
   AccessExclusiveLock. So we really only need to care about
   AccessExclusiveLock locks on the master
2) As far as I can see, the point of an AccessExclusiveLock is *not*
   actually preventing queries from running while a DDL statement is
   *executed*, but rather preventing queries from running while the
   statement is *committed*. This fits the fact that system tables are
   read using SnapshotNow (not SnapshotDirty) - while the DDL is
   running, everybody is happily using the old information, the trouble
   would only start after the commit because with SnapshotNow you
   suddenly see the new state.

I not yet 100% sure that (2) holds (with the exception of VACUUM FULL)-
but I'm fairly confident, because if (2) was wrong, than how would the
system survive a crash during the execution of a DDL statement?

So after a bit more thought (And reading. Thanks for all the replies,
guys! They are greatly appreciated.),
I came up with the following plan for both inval events and locks
.) Store two flags in the commit record of a transaction, for
   "transaction generated inval events" and "transaction held an
   access exlusive lock".
.) Upon replay, block until no transactions are running (for
   "transaction held an exclusive lock") before replaying the
   record, or flush the caches after replaying it (for
   "transaction generated inval events").

This scheme has two really nice properties:
First, it can be extended fairly easily to not store a simple flag, but
a list of OIDs, and use that to make the flushing and blocking more
fine-grained - without changing the basic way in which things work.

And second, it stores all information needed in the *commit* record.
That removes problems with transactions that are aborted due to a
crash, and therefor the WAL doesn't tell that they were aborted.

VACUUM FULL will need some special care - but at least VACUUM FULL is
already such a disruptive operation, that it probably won't surprise
anyone if it's disruptive on the slave too. (And now that CLUSTER is
MVCC-Safe from what I read, the usecase for VACUUM FULL seems to
be pretty slim).

The btree metapage caching will need special treatement too - probably
some flags in the WAL record that change the metapage that instruct the
slave to synthesize a suitable inval event.

What stays open is regular VACUUM (and maybe inline vacuuming - is that
part of the latest version of HOT, or did it get removed?). Here, the
key is logging the return value of GetOldestXmin() I think. Since that
value is what decides in the end if a tuple can be killed or not, having
it available on the slave should allow the slave to block replay until
no slave query depends on the tuple anymore.

greetings, Florian Pflug



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-23 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I came up with the following plan for both inval events and locks
.) Store two flags in the commit record of a transaction, for
"transaction generated inval events" and "transaction held an
access exlusive lock".
.) Upon replay, block until no transactions are running (for
"transaction held an exclusive lock") before replaying the
record, or flush the caches after replaying it (for
"transaction generated inval events").


This does not work; the lock has to be taken earlier than that.
(See for instance VACUUM's truncate calls.)  Not to mention that
you have converted "exclusive lock on one table" to "exclusive lock
on every table", which is even worse than the idea of converting
per-table cache flushes to system-wide ones.


I'll check what VACUUM is doing.. I primarily had CLUSTER and TRUNCATE
in mind.

That "exclusive lock on one table becomes exclusive lock on all tables"
issue can (as I wrote in the part of my mail that you sniped) be
solved I think by storing a list of OIDs instead of a flag for the
locks and inval events.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Full page images in WAL & Cache Invalidation

2007-07-23 Thread Florian G. Pflug

Simon Riggs wrote:

On Sun, 2007-07-22 at 19:58 +0200, Florian G. Pflug wrote:

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I'm currently working on correctly flushing the
catalog/relation/sgmr caches on a readonly PITR
slave during recovery.

I don't believe there is any workable solution to that short of logging
cache-flush operations in WAL.



The reason that I dislike WAL-logging of the flush operations so much is
that it since peopel are concerned about the amount of wal traffic 
postgres generated, such a solution would introduce yet another GUC.

And to make this reasonable foolproof, the slave would need a way to
detect if that GUC is set correctly on the master. All in all, that
seems to be quite hackish...


Seems like we should WAL log flush operations first. It's fairly
straightforward to do that and we can then measure its effect on the
primary easily enough. Your other suggestions seem much more complex.

I think we have a reasonable tolerance for increases in WAL and as you
said earlier, we may balance that out with other optimisations. Or we
may find a more efficient way of doing it later.

Let's aim to get that first query running, then go back and tune it
later.


I've so far added an LWLock that makes replay and queries mutually
exclusive, Simple testcases seem to work, but I haven't really
beaten the system yet...

Of course, my current version falls over as soon as you do
DDL on the master - working on fixing that, and on
subsequently removing that lock again :-)

greetings, Florian Pflug

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] Tracking a snapshot on PITR slaves

2007-08-02 Thread Florian G. Pflug

Hi

Since my attempts to find a simple solution for the read-only query
locking problems (Once that doesn't need full wal logging of lock
requests) haven't been successfully yet, I've decided to turn to the
problems of tracking a snapshot on the slaves for now. (Because first
such a snapshot is needed for any kind of concurrent recovery anyway, and
second because any non-simplistic solution of the locking problems
will quite likely benefit from such a snapshot).

The idea is to create a special kind of snapshot that works basically
like a MVCC snapshot, but with the meaning of the xip array inverted.
Usually, if a xid is *not* in the xip array of a snapshot, and greater
than the xmin of that snapshot, the clog state of the xid determines
tuple visibility. This is not well suited for queries running during
replay, because the effects of a xlog record with a (to the slave)
previously unknown xid shouldn't be visible to concurrently running
queries.

Therefore, flag xip_inverted will be added to SnapshotData, that causes
HeapTupleSatisfiesMVCC to assume that any xid >= xmin and *not* in the
xip array is in progress.

This allows the following to work:
.) Store RecentXmin with every xlog record, in a new field xl_xmin.
   (Wouldn't be needed in *every* record, but for now keeping it
   directly inside XLogRecord make things easier, and it's just 4 bytes)
.) Maintain a global snapshot template in shmem during replay, with the xmin
   being the highest xmin seen so far in any xlog record. That template
   is copied whenever a readonly query needs to obtain a snapshot.
.) Upon replaying a COMMIT or COMMIT_PREPARED record, the xmin of the
   to-be-committed transaction is added to the global snapshot,
   making the commit visibile to all further copies of that snapshot.

If you can shoot this down, you're welcome to do so ;-)

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] XID wraparound and busy databases

2007-08-16 Thread Florian G. Pflug

Tom Lane wrote:

Bruce Momjian <[EMAIL PROTECTED]> writes:

Is enlarging the xid field something we should consider for 8.4?


No.  We just got the tuple header down to 24 bytes, we are not going
to give that back and then some.

If you are processing 6K transactions per second, you can afford to
vacuum every couple days... and probably need to vacuum much more often
than that anyway, to avoid table bloat.

Possibly your respondent should think about trying to do more than one
thing per transaction?


I'm wondering how many of those 6k xacts/second are actually modifying
data. If a large percentage of those are readonly queries, than the need
for vacuuming could be reduced if postgres assigned an xid only if that
xid really hits the disk. Otherwise (for purely select-type queries) it
could use some special xid value.

This is what I'm doing in my Readonly-Queries-On-PITR-Slave patch.

greetings, Florian Pflug

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] SQL feature requests

2007-08-22 Thread Florian G. Pflug

Ben Tilly wrote:

On 8/22/07, Gregory Stark <[EMAIL PROTECTED]> wrote:

"Ben Tilly" <[EMAIL PROTECTED]> writes:

2. Why is 'non-integer constant in GROUP BY' an error?

Hm... I was a bit surprised by this warning myself. IIRC there was an
implementation convenience issue.


If your implementation accepts:

  group by case when true then 'foo' end

how much harder can it be to accept:

  group by 'foo'


This is not about hardness of the implementation, but rather about
non-confusing behaviour I think.

AFAIK, "group by 1" means "group by the first selected column", not
"group all rows together". But "group by 'foo'" would carry the second
meaning - "group all rows together". This is so totally counter-intuitive,
that it's not even funny...

"group by case when true then 'foo' end" looks different enough compared to
"group by 1" to make this less of a footgun.

Seems that the "group by " syntax predates the appearance of
aliases in the sql standard...

greetings, Florian flug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] SQL feature requests

2007-08-23 Thread Florian G. Pflug

Ben Tilly wrote:

On 8/22/07, Michael Glaesemann <[EMAIL PROTECTED]> wrote:

On Aug 22, 2007, at 20:49 , Ben Tilly wrote:


If your implementation accepts:

  group by case when true then 'foo' end

What would that mean? Regardless of whether or not it's accepted, it
should have *some* meaning.


To my eyes it has a very clear meaning, we're grouping on an
expression that happens to be a constant.  Which happens to be the
same for all rows.  Which is a spectacularly useless thing to actually
do, but the ability to do it happens to be convenient when I'm looking
for something to terminate a series of commas in a dynamically built
query.


Which is the same very clear meaning that "group by 1" has - we're
grouping on a expression which happens to be the constant 1. Hey,
wait a second. This isn't what "group by 1" means at all - it
rather means group by whatever the fist column in the select list is.

So, yes, "group by 'foo'" *seems* to have a very clear meaning - but
that clearness vanishes as soon as you take into account what "group by 1"
means.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] SQL feature requests

2007-08-23 Thread Florian G. Pflug

Kevin Grittner wrote:

On Thu, Aug 23, 2007 at  3:01 PM, in message <[EMAIL PROTECTED]>,
Tom Lane <[EMAIL PROTECTED]> wrote: 

The only argument I've
heard that carries much weight with me is that it eases porting from
other DBMS's that allow this.  Are there any others besides Oracle?
 
select * from (select f1 from t) 
 
In Sybase:
 
com.sybase.jdbc2.jdbc.SybSQLException: The derived table expression is missing a correlation name. Check derived table syntax in the Reference Manual.

Error code: 11753
SQL state: Z


The really funny thing is that pgsql, mysql and at least sybase
*explicitly* dissallow the no-alias case. Which shows that
  .) This seems to be common source of confusion and errors.
  .) Aliasless-Subqueries wouldn't lead to ambigous grammras in those databases.
 Otherwise, you'd expect to get some more generic syntax error, and not
 the very explicit "No alias, but expected one".

I agree with Tom - knowing *why* the standard committee disallows that syntax -
and why everybody except oracle chose to agree with it would be quite 
interesting.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] simple replication

2007-08-24 Thread Florian G. Pflug

Edoardo wrote:

On Aug 24, 4:27 pm, [EMAIL PROTECTED] ("Heikki Linnakangas")
wrote:

Edoardo wrote:

- Google Summer of Code effort by Florian Pflug
I would like to talk about the last one: I couldn't find any website /
explanation or further post.
The google 
linkhttp://code.google.com/soc/postgres/appinfo.html?csaid=6545828A8197EBC6
is broken
does anybody knows something about it?

He's working on allowing you to perform read-only queries on a warm
standby server. Currently, the warm standby can't be used for queries
until you fail over to it.


Even like this is worth considering this solution in my scenario, do
you think I can have a look at his work?


The archive of the pgsql-hackers list contains some discussion of my work.
I can also send you the latest version - though it's absolutely not ready
for production use at this point. It's more of a prototype, to find out where
the problems of lie, and how to overcome them.

Work on this won't stop with the end of GSoC, and I'm not the only one
interested in making this happen - so there is a realistic chance that
something like this will show up in postgres sooner or later - not promises,
though..

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions

2007-08-25 Thread Florian G. Pflug

I'm resending this without that actual patch attached, since my first
mail didn't get through. The patch can be found here:
http://soc.phlo.org/lazyxidassign.patch

Hi

Lately, there was some interest in assigning XIDs for toplevel Transactions
only when the transaction actually needs one (that is, it writes something
to the disk), like we currently do for subtransactions. This is obviously
also a crucial part of my work on read-only transactions on PITR slaves -
since there is no way to assign "real" XIDs on a PITR slave.

I've spent the last few days factoring out that work, and turning it into
a general solution. The result is this patch, which basically does the following

.) It defines a special TemporaryTransactionId that is used as an xact's xid
until the xact calls GetCurrentTransactionId / GetTopTransactionId.
.) It introduces a new macro TransactionIdIsPermanent, which tells if an
XID is valid, and not equal to TemporaryTransactionId.
.) It lets GetTopTransactionId assign an XID on-demand, similar to
how GetCurrentTransactionId handles that for subtransactions.
.) Each transaction get an "rid" (ResourceOwnerId) assigned when it starts, and
obtains a lock on that rid, similar to how the xid is locked. This can
be used to wait for a transaction's toplevel resource owner to release all
it's locks, and serves as a unique identifier for a running transaction.
This is needed for concurrent index builds to wait until all transactions
holding a ShareLock on the target relation have ended.

The patch passes the regression test, though there are currently two issues
that need to be resolved.
1) The second waiting phase of concurrent index builds fail to wait for xacts
that haven't been assigned an XID when the reference shapshot was taken.
The "rid" doesn't help here, because it's not currently store in the
snapshot.
2) I'm not entirely sure yet how to handle two flags MyXactMadeTempRelUpdates,
MyXactMadeXLogEntry and the MyRecPtr variable. Those seems to be made
partly redundant by this patch - checking if an xact has a
permanent xid assigned already tells if the transaction made any writes.

(1) could be easiy solves by querying the list of currently active RIDs after
taking the reference snapshot. But since AFAIK HOT introduces a new method for
guaranteeing that a transaction won't use an index that might not contain
all tuples that xact is interested in, I wanted to get feedback on how HOT
currently handles this.

It's probably not the best time to come up with new patches, since everybody
seems to be busy working on getting 8.3 out. But this patch is a quite natural
fallout of my work on read-only queries on PITR slaves, and I'd be very
interested to know if the general direction this patch takes is deemed
acceptable.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions

2007-08-25 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I've spent the last few days factoring out that work, and turning it into
a general solution. The result is this patch, which basically does the following



.) It defines a special TemporaryTransactionId that is used as an xact's xid
 until the xact calls GetCurrentTransactionId / GetTopTransactionId.


[ squint... ]  Why do you need that?  The behavior we use with
subtransactions is just to leave the XID as InvalidOid until there's
a reason to assign it, and I think we should do the same with top-level
XIDs.  Having an extra TemporaryTransactionId seems ugly, mainly because
it's not clear how XID comparison should handle it.


I invented the TemporaryTransactionId because it seemed worthwhile to
distinguish between backends which do not currently run a transaction
(xid == InvalidTransactionId), and such which run a transaction that
doesn't yet have an xid assiged (xid == TemporaryTransactionId). Currently,
the TemporaryTransactionId is treated to be later than any other xid
value. I'm not wedded to those TemporaryTransactionIds though - they just
seemed like a good idea when I started with my readonly-queries on PITR-slaves
work, and it allows for a few more assertions.


To leave XID at 0, you will need to teach GetSnapshotData and maybe
some other places that a proc could be advertising nonzero xmin even
when its XID is still 0.  This does not seem like a big problem though.


Yeah - TemporaryTransactionId removed the need for a few special cases
in that area - but at the cost of having the distinguish between
TransactionIdIsValid and TransactionIdIsPermanent (meaning valid && !temporary).


.) Each transaction get an "rid" (ResourceOwnerId) assigned when it starts, and
 obtains a lock on that rid, similar to how the xid is locked. This can
 be used to wait for a transaction's toplevel resource owner to release all
 it's locks, and serves as a unique identifier for a running transaction.


This seems like inventing a concept we could do without, also overhead
we could do without (assigning globally unique RIDs would require as
much mechanism and contention as XID assignment does).  What about
locking the backend's PID instead?  I do not see that long-term
uniqueness is needed, we just want to be able to wait for the backend's
current transaction to end.

If you do think it's important to distinguish the other guy's current
and next transaction (which maybe it is), then possibly we could lock a
combination of the PID and a *local*, per-backend transaction counter
(there should be plenty of room in LOCKTAG for this).  This counter
value would have to be advertised in PGPROC, but there wouldn't be any
contention involved to assign a new value.


I wanted some (single) value that would fit into some standard C datatype.
Since I guess using "long long" in postgres code code is a bad idea
(It's not supported on all 32-bit plattforms I think), I wanted to come
up with some 32-bit identifier. If the PID were guaranteed to be 16-bit
we could use the other 16 bits as a counter - but all modern Unixen have
moved past a limit of 65535 processes I fear...

While writing this mail I realized that my RID generation algorithm -
while being quite lightweight I think - has a small race condition.
The algorithm is
for(;;) {
  rid = ShmemVariableCache->nextRid++ ;

  if (ResourceOwnerIdIsValid(rid) && ResourceOwnerLockTableInsert(rid))
break ;
}

I just realized that if two backend manage to obtain the same rid, and
one than is paged out long enough for the other to lock the rid, run
it's transaction and commit, then the second backend will get the same
rid :-(. So it's back to the drawing board anyway...


It's slightly annoying that this scheme involves taking two lmgr locks
per read-write transaction.  I wonder whether we couldn't dispense with
the notion of locking one's XID per se.  This would mean that where we
try to wait for another transaction by XID, we have to trawl through
the ProcArray to find that XID and see what PID/localID it maps to;
but if we're in that path we're already going to be blocking, so more
cycles there might be a good tradeoff for fewer cycles in transaction
start.


Yeah - I do not really like that dual-locking thing either. But it makes
prepared transaction handling much easier - if we were to only lock the
RID, we'd have to store the rid<->xid mapping for prepared transactions
somewhere *and* guarantee that we won't assign that RID to another transaction -
even after a server restart...


1) The second waiting phase of concurrent index builds fail to wait for xacts
 that haven't been assigned an XID when the reference shapshot was taken.
 The "rid" doesn't help here, because it's not currently s

Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions

2007-08-27 Thread Florian G. Pflug

Tom Lane wrote:

I wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Yeah - I do not really like that dual-locking thing either. But it makes
prepared transaction handling much easier - if we were to only lock the
RID, we'd have to store the rid<->xid mapping for prepared transactions



Hmmm  that's a good point.  Not sure how to include prepared xacts
in the scheme.


After further thought I think we should avoid depending on PIDs as part
of a lock tag --- their behavior is too system-dependent, in particular
we have no right to assume that when a backend exits the same PID won't
be reassigned shortly after, possibly leading to confusion.

Instead, I suggest that we keep a session counter in shared memory and
have each backend assign itself a session ID at startup using that.
A 32-bit session ID in combination with a 32-bit locally assigned
transaction number should be sufficiently unique to identify a
transaction (prepared or otherwise) for the purposes of locking.
These "transient XIDs" only need to be unique as long as the transaction
exists plus shortly thereafter (to avoid race conditions when someone
waits for a transaction that actually terminated a moment before).
So wraparound of the counters isn't a problem, although we probably
want to reserve zero as an invalid value.

I think we only need a transient XID of this sort for top-level
transactions, not subtransactions.


Sounds good, if we decide to go with the transient XID idea. So below
for an alternative that I just came up with.


To make CREATE INDEX CONCURRENTLY work, we'd need two things:

* GetLockConflicts would need to report the transient XIDs of the
  conflicting xacts, not regular XIDs, since they might not have regular
  XIDs.  Then we'd wait on those locks instead of regular-XID locks.


Yes. This is exactly what my patch does today.


* The second phase where we wait out transactions that can still see
  old tuples doesn't work because such transactions won't necessarily
  be listed in the snapshot.  Instead, what we have to do is look
  through the ProcArray for transactions whose advertised xmin is
  less than the xmax of our reference snapshot.  When we find one,
  wait for it using its transient XID.



AFAICT, C.I.C. is currently the only place in the system where we
really need transient XIDs at all.  Everyplace else that we need to
wait for a transaction, it's because we found its regular XID in
a tuple we want to lock or modify.  So the whole thing is a bit
annoying.  Maybe we could get rid of the extra overhead with some
shenanigans inside the lock manager, like not bothering to create
a data structure representing the holding of a transient-XID lock
until such time as C.I.C. actually tries to wait for it.  But
again, that seems like a second-pass optimization.


I've given some thought to that. There are two distinct things we
need to be able to wait for

1) Until all current holders of a lock, grantmask conflicts with
a given locklevel have dropped their lock

2) Until all currently in-use snapshots have an xmin larger than
some given value (The xmax of the reference snapshot).

(1) Could be solved directly in the lock manager. We'd need some
mechanism to wake up a process whenever someone releases a
ceratin lock.

(2) Could be done by acquireing a ShareLock (with a new locktype
LOCKTYPE_XMIN) on the xmin of a transaction's serializable
snapshot when it's created.
The second waiting phase of concurrent index builds would then be
  a) Find the oldest xmin in the ProcArray.
  b) If that xmin is equal or greater than the xmax of our
 reference snapshot, we're done.
  c) Wait until the ExclusiveLock (for LOCKTYPE_TRANSACTION)
 is released on that xmin. After that point, new transactions
 will compute an xmin greater than the oldest one we found
 in the ProcArray, because the limiting transactions has
 exited, and because ReadNewTransactionId returns a value
 greater than that xmin too (Otherwise, we'd have exited in (b)).
  d) Wait for all current holders of LOCKTYPE_XMIN to release
 their locks. (Using the machinery needed for (1)). No
 new holders can show up, because new snapshots will computer
 a larger xmin.
  e) Goto a).

I could code (2), but I'd need help with (1) - The details of the locking
subsystems are still somewhat a mystery to me.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions

2007-08-27 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Sounds good, if we decide to go with the transient XID idea. So below
for an alternative that I just came up with.


This proposal appears to require taking and releasing a brand-new lock
type every time a snapshot is made or destroyed.  That is certainly not
going to be less overhead than the transient-XID scheme.  At least in
READ COMMITTED mode, there are normally multiple snapshots taken per
transaction.


Only for the serializable shapsnot I'd have thought, making the overhead
bearable (it's is surely the oldest of all the xact's shapshots) but ...


(Something worth noting here is that I expect soon, probably 8.4,
we will fix things so that what a backend advertises in MyProc->xmin
is the xmin of its oldest still-live snapshot.  That means that xmin
will change intra-transaction in READ COMMITTED mode, and thus that
we would indeed need to take and release the sort of lock you are
suggesting each time.)


... with this in mind, my proposal looks pretty bad :-(

What do you think about solving the requirements of the *first* waiting
phase (Where we wait for current ShareLock holders) inside the lock manager?
The only real downside I can see is that I feel uneasy about messing with
that code... It seems to be subtle, and quick to anger ;-)

For the second phase, I see two options
  .) Go forward with the transient XIDs / RIDs approach
  .) Do something similar to the indcreatexid idea the HOT patch implements.
 This essentially
 puts the burden of deciding an index's usability on the using xact,
 not on the one creating the index.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [WIP PATCH] Lazily assign xids for toplevel Transactions

2007-08-27 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

What do you think about solving the requirements of the *first* waiting
phase (Where we wait for current ShareLock holders) inside the lock manager?
The only real downside I can see is that I feel uneasy about messing with
that code... It seems to be subtle, and quick to anger ;-)


It sounded pretty dubious to me.  The problem is that we don't want to
wait until we could actually *get* the lock, we only want to wait out
the conflicting xacts that existed when we looked.  This is important
because there might be a steady stream of new xacts acquiring
conflicting locks (ie, a steady stream of writers), and we don't want to
either block them, or have to hope for a window where there are none.
But the lock manager does not currently track who acquired a lock when,
and I think it would add a lot of usually-wasted overhead to do that.

I spent a fair amount of time yesterday trying to think of alternative
solutions, and didn't really think of much.  The core reason why C.I.C.
is implemented the way it is is that it's entirely possible that there
will be a deadlock with some other process (ie, the other process is
old enough that we must wait for it, but it's blocked trying to acquire
some lock that conflicts with our ShareUpdateExclusiveLock).  Thus,
waiting by trying to acquire XID locks is a good idea because it
automatically detects deadlock, and may even be able to escape the
deadlock by wait-queue rearrangement.  (I'm not certain that the latter
is applicable in any situation C.I.C. would get into, but I'm not
certain it's not, either.)  Schemes involving "sleep awhile and check
the ProcArray again" are right out because they fail to detect
deadlock.  Everything else I could think of involved special new
lockmanager features that would have to still preserve the ability
to handle deadlocks, which didn't sound like something I want to
tackle for this.

So on the whole the extra transaction identifier seems to be the
way to go.  I haven't looked at how that interacts with HOT though.


Ok, I'll update the patch to use your global id plus local id
idea than, and remove TemporaryTransactionIds.

greetings, Florian Pflug


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

2007-08-29 Thread Florian G. Pflug

Hi

When we lazily assign XIDs, we gain another flag beside the existing
MyXactMadeTempRelUpdates, MyXactMadeXLogEntry, MyLastRecPtr and smgr's
pendingDeletes to tell what kind of actions a transaction performed. Adding
TransactionIsIsValid(GetCurrentTransactionIdIfAny()) on top of that
makes things quite impenetrable - at least for me. So I'm trying to
wrap my head around that logic, and simplify it a bit if possible.
(Nowadays, async commit even adds a bit more complexity)

Currently, we write out a COMMIT record if a transaction either created
any transaction-controlled XLOG entries (MyLastRecPtr.xrecoff != 0), or
scheduled the deletion of files on commit. Afterwards, the xlog is flushed
to the end last record created by the session (ProcLastRecEnd) if the
transaction created any xlog record at all. If we either made
transaction-controlled XLOG entries, or temporary relation updates, we
update the XID status in the CLOG.

An ABORT record is currently created if a transaction either created
transaction-controlled XLOG entries or scheduled the deletion of files
on abort. If we schedules file deletions, we flush the XLOG up to the
ABORT record. If we either made transaction-controlled XLOG entries,
updated temporary relations, or scheduled deletions we update the XID
status in the CLOG.

For subtransaction commit, a COMMIT record is emitted if we either made
transaction-controlled XLOG entries, or updated temporary relations.
No XLOG flush is performed.

Subtransaction ABORTS are handled the same way as regular transaction
aborts.

For toplevel transaction commits, we defer flushing the xlog if
synchronous_commit = off, and we didn't schedule any file deletions.

Now, with lazy XID assignment I believe the following holds true
.) If we didn't assign a valid XID, we cannot have made transaction-controlled
   XLOG entries (Can be checked by asserting that the current transaction id
   is valid if XLOG_NO_TRAN isn't set in XLogInsert).
.) We cannot have scheduled files for deletion (on either COMMIT or ABORT)
   if we don't have a valid XID, since any such deletion will happen together
   with a catalog update. Note that this is already assumed to be true for
   subtransactions, since they only call RecordSubTransaction{Commit|Abort}
   if they have an XID assigned.

I propose to do the following in my lazy XID assignment patch - can
anyone see a hole in that?

.) Get rid of MyLastRecPtr and MyXactMadeTempRelUpdates. Those are
   superseeded by TransactionIdIsValid(GetCurrentTransactionIdIfAny()).
.) Get rid of MyXactMadeXLogEntry. Instead, just reset ProcLast
.) Rename ProcLastRecEnd to XactLastRecEnd, and reset when starting
   a new toplevel transaction.

Transaction COMMIT:
  Write an COMMIT record if and only if we have a valid XID.
  Then, flush the XLOG to XactLastRecEnd if that is set, and
  synchronous_commit=on.
  Aferwards, update the CLOG if and only if we have a valid XID.

Transaction ABORT:
  Write an ABORT record if and only if we have a valid XID.
  Then, flush the XLOG to XactLastRecEnd if that is set, and
  we scheduled on-abort deletions.
  Update the CLOG if and only if we have a valid XID.

Subtransaction COMMIT:
  Update the CLOG if and only if we have a valid XID.

Subtransaction ABORT:
  Write an ABORT record if and only if we have a valid XID.
  Then, flush the XLOG to XactLastRecEnd if that is set, and
  we scheduled on-abort deletions.
  Update the CLOG if and only if we have a valid XID.

I think we might go even further, and *never* flush the XLOG on abort,
since if we crash just before the abort won't log anything either. But
if we leak the leftover files in such a case, that's probably a bad idea.

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)

2007-08-29 Thread Florian G. Pflug

Hi

Since generating transient XIDs (named ResourceOwnerIDs in my patch, since
their lifetime is coupled to the lifetime of a transaction's toplevel
resource owner) seems to be to way to go for lazx xid assignment, I need
to find a way to represent them in the pg_locks view.

ResourceOwnerIds are a structure composed of two uint32s, a processID
(could be the PID of the backend, but to make sure that it isn't reused
too quickly, it's actually a synthentic ID generated at backend start),
and localTransactionId which is just incremented whenever a new transaction
is started in a backend. This design was the result of my discussion with
Tom - it's main advantage is that it needs no lock to generate a new
ResourceOwnerId.

I see 3 possibilities to represent this in system views
A) Make ResourceOwnerID a full-blown type, with in and out methods, very
   similar to tids. "processId/localTransactionId" would be a natural
   string representation.
B) Just convert the ResourceOwnerId into a string in pg_lock_status.
   Looks quite similar to (A) from a user's point of view, but the
   implementation is much shorter.
C) Combine the two uint32 fields of ResourceOwnerId into a int8.
   Might be more efficient than (B). The main disadvantage is that
   some ResourceOwnerIds will be represented by *negative* integers,
   which is pretty ugly.
D) Just make them two int4 fields. This has the same "negativity"
   issue that (C) has, and might cause confusion if users don't
   read the docs carefully.

I'm leaning towards (A), but it adds a lot new code (although most if
it would be copied nearly 1-to-1 from tid.c) for maybe too little gain.

If (A) is deemed not appropriate, doing (C) and restricting processIds
to <= 0x8000 might be an option.

greetings, Florian Pflug


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I propose to do the following in my lazy XID assignment patch - can
anyone see a hole in that?

>

One comment is that at the time we make an entry into smgr's
pending-deletes list, I think we might not have acquired an XID yet
--- if I understand your patch correctly, a CREATE TABLE would acquire
an XID when it makes its first catalog insertion, and that happens
after creating the on-disk table file.  So it seems like a good idea
for smgr itself to trigger acquisition of an XID before it makes a
pending-deletes entry.  This ensures that you can't have a situation
where you have deletes to record and no XID; otherwise, an elog
between smgr insertion and catalog insertion would lead to just that.


I wonder a bit about the whole special-casing
of COMMITs/ABORTs with pending delete, though. A crash might always leave
stray file around, so there ought to be a way to clean them up anyway.
Still, for now I'll go with your suggestion, and force XID assignment
in the smgr.


.) Rename ProcLastRecEnd to XactLastRecEnd, and reset when starting
a new toplevel transaction.


I'm not very happy with that name for the variable, because it looks
like it might refer to the last transaction-controlled record we
emitted, rather than the last record of any type.  Don't have a really
good suggestion though --- CurXactLastRecEnd is the best I can do.


Hm.. don't have a good suggestion, either - the reason I want to rename
it is that ProcLastRecEnd doesn't sound like it's be reset at transaction
start.


One thought here is that it's not clear that we really need a concept of
transaction-controlled vs not-transaction-controlled xlog records
anymore.  In CVS HEAD, the *only* difference no_tran makes is whether
to set MyLastRecPtr, and you propose removing that variable.  This
seems sane to me --- the reason for having the distinction at all was
Vadim's plan to implement transaction UNDO by scanning its xlog records
backwards, and that idea is as dead as a doornail.  So we could simplify
matters conceptually if we got rid of any reference to such a
distinction.


I've thinking about keeping XLOG_NO_TRAN, and doing
if (!no_tran)
  Assert(TransactionIdIsValid(GetCurrentTransactionIdIfAny())
in xlog.c as a safety measure. We can't make that assertion
unconditionally, I think, because nextval() won't force XID
assigment, but might do XLogInsert.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Since generating transient XIDs (named ResourceOwnerIDs in my patch, since
their lifetime is coupled to the lifetime of a transaction's toplevel
resource owner) seems to be to way to go for lazx xid assignment, I need
to find a way to represent them in the pg_locks view.


This is going very far towards gilding the lily.  Try to avoid loading
the patch down with a new datatype.

I'm inclined to think that it'd be sufficient to show the high half of
the ID (that is, the session number) in pg_locks, because there will
never be cases where there are concurrently existing locks on different
localTransactionIds.


Hm.. I'm not too happy with that. I you for example join pg_locks to
pg_stat_activity (which would need to show the RID too), than you
*might* get a bogus result if a transaction ends and a new one starts
on the same backend between the time pg_lock_status is called, and the time
the proc array is read.


This could probably be displayed in the
transactionID columns, which would mean we're abusing the user-visible
xid datatype, but I don't see much harm in it.


I'm even more unhappy with that, because the session id of a RID might
coincide with a currently in-use XID.

What about the following.
.) Remove the right-hand side XID from pg_locks (The one holder or waiter
   of the lock). It seems to make more sense to store a RID here, and let
   the user fetch the XID via a join to pg_stat_activity. We could also show
   both the XID (if set) and the RID, but that might lead people to believe
   that their old views or scripts on top of pg_locks still work correctly
   when they actually do not.
.) On the left-hand side (The locked object), add a RID column of type int8,
   containing (2^32)*sessionID + localTransactionId.
.) To prevent the int8 from being negative, we limit the sessionID to 31 bytes -
   which is still more then enough.

greetings, Florian Pflug


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

One thought here is that it's not clear that we really need a concept of
transaction-controlled vs not-transaction-controlled xlog records
anymore.



I've thinking about keeping XLOG_NO_TRAN, and doing
if (!no_tran)
   Assert(TransactionIdIsValid(GetCurrentTransactionIdIfAny())
in xlog.c as a safety measure.


Why do you think this is a safety measure?  All that it is checking
is whether the caller has preserved an entirely useless distinction.
The real correctness property is that you can't write your XID
into a heap tuple or XLOG record if you haven't acquired an XID,
but that seems nearly tautological.


I was confused. I wanted to protect against the case the an XID hits
the disk, but doesn't show up in any xl_xid field, and therefore might
be reused after crash recovery. But of course, to make that happen
you'd have to actually *store* the XID into the data you pass to
XLogInsert, which is kind of hard if you haven't asked for it first.

So, I now agree, XLOG_NO_TRAN should be buried.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[HACKERS] int8 & INT64_IS_BUSTED

2007-08-29 Thread Florian G. Pflug

Hi

I'm confused about whether int8s work on a machine on which
INT64_IS_BUSTED. My reading of the code suggests that int8
will be available, but be, well, busted in such a machine.

For example, int8mul seems as if I'd just return the wrong
answer on such a machine.

Or are platforms with INT64_IS_BUSTED no longer supported,
and are all those #ifdefs only legacy code?

Please enlighten a poor linux+gcc user who can't remember
ever using a compiler without a "long long" datatype after
leaving TurboC under DOS.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Representation of ResourceOwnerIds (transient XIDs) in system views (lazy xid assignment)

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

What about the following.
.) Remove the right-hand side XID from pg_locks (The one holder or waiter
of the lock). It seems to make more sense to store a RID here,


Yeah, we have to do that since there might not *be* an XID holding the
lock.  But I still think the session ID would be sufficient here.
(Perhaps we don't need the PID either, although then we'd need to change
pg_stat_activity to provide session id as a join key...)


Yeah, the PID seems to be redundant if we add the RID. But OTOH it does no
harm to leave it there - other than the xid, which gives a false sense
of security. Don't know what our policy for system-catalog
backwards-compatibility is, though...


.) On the left-hand side (The locked object), add a RID column of type int8,
containing (2^32)*sessionID + localTransactionId.


I'm a bit uncomfortable with that since it renders the view completely
useless if you don't have a working int8 type.


Yeah, I only now realized that int8 really *is* busted if INT64_IS_BUSTED is
defined. I always thought that there is some kind of emulation code in place,
but apparently there isn't. :-( So there goes this idea


.) To prevent the int8 from being negative, we limit the sessionID to 31 bytes -
which is still more then enough.


Hmm ... actually, that just begs the question of how many bits we need
at all.  Could we display, say, 24 bits of sessionID and 8 bits of
localXID merged into a column of nominal XID type?  There's a
theoretical risk of false join matches but it seems pretty theoretical,
and a chance match would not break any system functionality anyway since
all internal operations would be working with full-width counters.


Hm.. If we go down that router, we could just calculate some hash value
from sessionID and localTransactionId that fits into 31 bits, and use
an int4. Or 32 bits, and use xid.

I am, however a bit reluctant to do this. I'd really hate to spend a few hours
tracking down some locking problem, only to find out that I'd been looking at
the wrong place because of some id aliasing... I know it's only a 1-in-4-billion
chance, but still it gives me an uneasy feeling.

What about a string representation? Something like sessionId/localTransactionId?
Should we ever decide that indeed this *should* get it's own datatype, a string
representation would allow for a very painless transition...

greetings, Florian Pflug


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] int8 & INT64_IS_BUSTED

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

I'm confused about whether int8s work on a machine on which
INT64_IS_BUSTED. My reading of the code suggests that int8
will be available, but be, well, busted in such a machine.


The datatype exists, but it's really only int32.


For example, int8mul seems as if I'd just return the wrong
answer on such a machine.


Well, obviously it's gonna overflow sooner than you'd think, but it will
give valid answers as long as you never try to compute a value that
doesn't fit in int32; and it will correctly complain if you do.


I still think int8mul is buggy. It calculates result as arg1 * arg2, and then
checks for an overflow by dividing again, and seeing if the right answer
comes out. Which sounds good. But it *skips* that check if both arguments
fit into an int32 - check is
(arg1 == (int64) ((int32) arg1) && arg2 == (int64) ((int32) arg2)).

Which for INT64_IS_BUSTED seems to be equivalent to
(arg1 == arg1 && arg2 == arg2), and thus the check will never fire
in that case.

I didn't test this though - so maybe I'm just reading it wrong.

greetings, Florian Pflug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

One comment is that at the time we make an entry into smgr's
pending-deletes list, I think we might not have acquired an XID yet
--- if I understand your patch correctly, a CREATE TABLE would acquire
an XID when it makes its first catalog insertion, and that happens
after creating the on-disk table file.  So it seems like a good idea
for smgr itself to trigger acquisition of an XID before it makes a
pending-deletes entry.  This ensures that you can't have a situation
where you have deletes to record and no XID; otherwise, an elog
between smgr insertion and catalog insertion would lead to just that.


Hm.. I was just going to implement this, but I'm now wondering if
thats really worth it.

For smgrcreate, this would catch the following case:
.) CREATE something
.) smgrcreate: Creates file, and puts it onto the delete-on-abort
   list
.) We elog() *before* acquiring an XID
.) RecordTransactionAbort or RecordSubTransactionAbort:
   We don't write an ABORT record.
.) We crash *before* actually deleting the file

Compare the probability of that happening (The elog *and* the crash)
with the probability of
.) CREATE something
.) smgrcreate: Creates the file
.) We crash *before* we have to chance to commit or abort.

The window in which a crash causes us to leak the file seems to be much
wider in the second case, yet forcing XID assignment will not help to
preven it, unless I'm overlooking something.

In the smgrunlink case, there is no reason at all to force XID assignment,
because if we abort or crash, we don't want to unlink anyway, and if we
survive until we commit, we'll assign the XID during the inevitable catalog
update.

The only thing the forced XID assignment would buy is to be able to stick
if (TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
  Assert(nrels == 0);
into the various Record{Sub|}Transction{Commit|Abort} functions

So unless I'm overlooking something, I believe for now it's best to ignore this
issued, and to do a proper fix in the long run that removes *all* possible
leakages.

greetings, Florian Pflug


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Advice on MyXactMade* flags, MyLastRecPtr, pendingDeletes and lazy XID assignment

2007-08-29 Thread Florian G. Pflug

Tom Lane wrote:

"Florian G. Pflug" <[EMAIL PROTECTED]> writes:

One comment is that at the time we make an entry into smgr's
pending-deletes list, I think we might not have acquired an XID yet



Hm.. I was just going to implement this, but I'm now wondering if
thats really worth it.


Basically what you'd give up is the ability to Assert() that there are
no deletable files if there's no XID, which seems to me to be an
important cross-check ... although maybe making smgr do that turns
this "cross-check" into a tautology ... hmm.  I guess the case that's
bothering me is where we reach commit with deletable files and no XID.
But that should probably be an error condition anyway, ie, we should
error out and turn it into an abort.  On the abort side we'd consider
it OK to have files and no XID.  Seems reasonable to me.


I've done that now, and it turned out nicely. There is an Assertion
on "(nrels == 0) || xid assigned" in the COMMIT path, but
not in the ABORT path. Seems reasonable and safe.

And I'm quite tempted to not flush the XLOG at all during ABORT, and to
only force synchronous commits if one of the to-be-deleted files is
non-temporary. The last idea widens the leakage window quite a bit
though, so I maybe I should rather resist that temptation...

OTOH, it'd allow aynchronous commits for transactions that created
temporary tables.


The only way we could make this more robust is if we could have
WAL-before-data rule for file *creation*, but I think that's not
possible given that we don't know what relfilenode number we will use
until we've successfully created a file.  So there will always be
windows where a crash leaks unreferenced files.  There's been some
debate about having crash recovery search for and delete such files, but
so far I've resisted it on the grounds that it sounds like data loss
waiting to happen --- someday it'll delete a file you wished it'd kept.


It seems doable, but it's not pretty. One possible scheme would be to
emit a record *after* chosing a name but *before* creating the file,
and then a second record when the file is actually created successfully.

Then, during replay we could remember a list of xids and filenames,
and remove those files for which we either haven't seen a "created
successfully" record, or no COMMIT record for the creating xid.

With this scheme, I'd be natural to force XID assignment in smgrcreate,
because we'd actually depend on logging the xid there.

greetings, Florian Pflug


---(end of broadcast)---
TIP 6: explain analyze is your friend


  1   2   3   4   >