AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-30 Thread Zeugswetter Andreas SB


So are whole pages stored in rollback segments or just
the modified data?
   
   This is implementation dependent. Storing whole pages is
   much easy to do, but obviously it's better to store just
   modified data.
  
  I am not sure it is necessarily better. Seems to be a tradeoff here.
  pros of whole pages:
  a possible merge with physical log (for first
modification of a page after checkpoint
  there would be no overhead compared to current 
since it is already written now)
 
 Using WAL as RS data storage is questionable.

No, I meant the other way around. Move the physical log pages away from WAL 
files to the rollback segment (imho snapshot area would be a better name)

  in a clever implementation a page already in the
rollback segment might satisfy the 
  modification of another row on that page, and 
thus would not need any additional io.
 
 This would be possible only if there was no commit (same SCN)
 between two modifications.

I don't think someone else's commit matters unless it touches the same page.
In that case a reader would possibly need to chain back to an older version 
inside the snapshot area, and then it gets complicated even in the whole page 
case. A good concept could probably involve both whole page and change
only, and let the optimizer decide what to do.

 But, aren't we too deep on overwriting smgr (O-smgr) implementation?

Yes, but some understanding of the possibilities needs to be sorted out 
to allow good decicsions, no ?

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Zeugswetter Andreas SB


 You mean it is restored in session that is running the transaction ?
   
   Depends on what you mean with restored. It first reads the heap page,
   sees that it needs an older version and thus reads it from the rollback 
segment.
  
  So are whole pages stored in rollback segments or just the modified data?
 
 This is implementation dependent. Storing whole pages is much easy to do,
 but obviously it's better to store just modified data.

I am not sure it is necessarily better. Seems to be a tradeoff here.
pros of whole pages:
a possible merge with physical log (for first modification of a page after 
checkpoint 
there would be no overhead compared to current since it is already 
written now)
in a clever implementation a page already in the rollback segment might 
satisfy the 
modification of another row on that page, and thus would not need any 
additional io.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   So are whole pages stored in rollback segments or just
   the modified data?
  
  This is implementation dependent. Storing whole pages is
  much easy to do, but obviously it's better to store just
  modified data.
 
 I am not sure it is necessarily better. Seems to be a tradeoff here.
 pros of whole pages:
   a possible merge with physical log (for first
   modification of a page after checkpoint
   there would be no overhead compared to current 
   since it is already written now)

Using WAL as RS data storage is questionable.

   in a clever implementation a page already in the
   rollback segment might satisfy the 
   modification of another row on that page, and 
   thus would not need any additional io.

This would be possible only if there was no commit (same SCN)
between two modifications.

But, aren't we too deep on overwriting smgr (O-smgr) implementation?
It's doable. It has advantages in terms of IO active transactions
must do to follow MVCC. It has drawback in terms of required
disk space (and, oh yeh, it's not easy to implement -:)).
So, any other opinions about value of O-smgr?

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   Seems overwrite smgr has mainly advantages in terms of
   speed for operations other than rollback.
  
  ... And rollback is required for  5% transactions ...
 
 This obviously depends on application. 

Small number of aborted transactions was used to show
useless of UNDO in terms of space cleanup - that's why
I use same argument to show usefulness of O-smgr -:)

 I know people who rollback most of their transactions
 (actually they use it to emulate temp tables when reporting). 

Shouldn't they use TEMP tables? -:)

 OTOH it is possible to do without rolling back at all as
 MySQL folks have shown us ;)

Not with SDB tables which support transactions.

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   OTOH it is possible to do without rolling back at all as
   MySQL folks have shown us ;)
  
  Not with SDB tables which support transactions.
 
 My point was that MySQL was used quite a long time without it 
 and still quite many useful applications were produced.

And my point was that needless to talk about rollbacks in
non-transaction system and in transaction system one has to
implement rollback somehow.

 BTW, do you know what strategy is used by BSDDB/SDB for 
 rollback/undo ?

AFAIR, they use O-smgr = UNDO is required.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Zeugswetter Andreas SB


  You mean it is restored in session that is running the transaction ?

Depends on what you mean with restored. It first reads the heap page,
sees that it needs an older version and thus reads it from the rollback segment.

  
  I guess thet it could be slower than our current way of doing it.
 
 Yes, for older transactions which *really* need in *particular*
 old data, but not for newer ones. Look - now transactions have to read
 dead data again and again, even if some of them (newer) need not to see
 those data at all, and we keep dead data as long as required for other
 old transactions *just for the case* they will look there.
 But who knows?! Maybe those old transactions will not read from table
 with big amount of dead data at all! So - why keep dead data in datafiles
 for long time? This obviously affects overall system performance.

Yes, that is a good description. And old version is only required in the following 
two cases:

1. the txn that modified this tuple is still open (reader in default committed read)
2. reader is in serializable transaction isolation and has earlier xtid

Seems overwrite smgr has mainly advantages in terms of speed for operations
other than rollback.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Vadim Mikheev

 Yes, that is a good description. And old version is only required in the following 
 two cases:
 
 1. the txn that modified this tuple is still open (reader in default committed read)
 2. reader is in serializable transaction isolation and has earlier xtid
 
 Seems overwrite smgr has mainly advantages in terms of speed for operations
 other than rollback.

... And rollback is required for  5% transactions ...

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Vadim Mikheev

You mean it is restored in session that is running the transaction ?
  
  Depends on what you mean with restored. It first reads the heap page,
  sees that it needs an older version and thus reads it from the rollback segment.
 
 So are whole pages stored in rollback segments or just the modified data?

This is implementation dependent. Storing whole pages is much easy to do,
but obviously it's better to store just modified data.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Zeugswetter Andreas SB


   Impractical ? Oracle does it.
  
  Oracle has MVCC?
  
  With restrictions, yes.
 
 What restrictions? Rollback segments size?

No, that is not the whole story. The problem with their rollback segment approach is,
that they do not guard against overwriting a tuple version in the rollback segment. 
They simply recycle each segment in a wrap around manner.
Thus there could be an open transaction that still wanted to see a tuple version
that was already overwritten, leading to the feared snapshot too old error.

Copying their rollback segment approach is imho the last thing we want to do.

 Non-overwriting smgr can eat all disk space...
 
  You didn't know that?  Vadim did ...
 
 Didn't I mention a few times that I was inspired by Oracle? -:)

Looking at what they supply in the feature area is imho good.
Copying their technical architecture is not so good in general.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

 Do we want to head for an overwriting storage manager?
 
 Not sure.  
 
 Advantages:  UPDATE has easy space reuse because usually done
 in-place, no index change on UPDATE unless key is changed.
 
 Disadvantages:  Old records have to be stored somewhere for MVCC use. 
 Could limit transaction size.

Really? Why is it assumed that we *must* limit size of rollback segments?
We can let them grow without bounds, as we do now keeping old records in
datafiles and letting them eat all of disk space.

 UNDO disadvantages are:
 
   Limit size of transactions to log storage size.

Don't be kidding - in any system transactions size is limitted
by available storage. So we should tell that more disk space
is required for UNDO. From my POV, putting $100 to buy 30Gb
disk is not big deal, keeping in mind that PGSQL requires
$ZERO to be used.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

   Oracle has MVCC?
   
   With restrictions, yes.
  
  What restrictions? Rollback segments size?
 
 No, that is not the whole story. The problem with their
 rollback segment approach is, that they do not guard against
 overwriting a tuple version in the rollback segment.
 They simply recycle each segment in a wrap around manner.
 Thus there could be an open transaction that still wanted to
 see a tuple version that was already overwritten, leading to the
 feared snapshot too old error.
 
 Copying their rollback segment approach is imho the last 
 thing we want to do.

So, they limit size of rollback segments and we don't limit
how big our datafiles may grow if there is some long running
transaction in serializable mode. We could allow our rollback
segments to grow without limits as well.

  Non-overwriting smgr can eat all disk space...
  
   You didn't know that?  Vadim did ...
  
  Didn't I mention a few times that I was inspired by Oracle? -:)
 
 Looking at what they supply in the feature area is imho good.
 Copying their technical architecture is not so good in general.

Copying is not inspiration -:)

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Mikheev, Vadim

  Removing dead records from rollback segments should
  be faster than from datafiles.
 
 Is it for better locality or are they stored in a different way ?

Locality - all dead data would be localized in one place.

 Do you think that there is some fundamental performance advantage
 in making a copy to rollback segment and then deleting it from
 there vs. reusing space in datafiles ?

As it showed by WAL additional writes don't mean worse performance.
As for deleting from RS (rollback segment) - we could remove or reuse
RS files as whole.

   How does it do MVCC with an overwriting storage manager ?
  
  1. System Change Number (SCN) is used: system increments it
 on each transaction commit.
  2. When scan meets data block with SCN  SCN as it was when
 query/transaction started, old block image is restored
 using rollback segments.
 
 You mean it is restored in session that is running the transaction ?
 
 I guess thet it could be slower than our current way of doing it.

Yes, for older transactions which *really* need in *particular*
old data, but not for newer ones. Look - now transactions have to read
dead data again and again, even if some of them (newer) need not to see
those data at all, and we keep dead data as long as required for other
old transactions *just for the case* they will look there.
But who knows?! Maybe those old transactions will not read from table
with big amount of dead data at all! So - why keep dead data in datafiles
for long time? This obviously affects overall system performance.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

  Impractical ? Oracle does it.
 
 Oracle has MVCC?
 
 With restrictions, yes.

What restrictions? Rollback segments size?
Non-overwriting smgr can eat all disk space...

 You didn't know that?  Vadim did ...

Didn't I mention a few times that I was
inspired by Oracle? -:)

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

 If PostgreSQL wants to stay MVCC, then we should imho forget
 overwriting smgr very fast.
 
 Let me try to list the pros and cons that I can think of:
 Pro:
   no index modification if key stays same
   no search for free space for update (if tuple still
 fits into page)
   no pg_log
 Con:
   additional IO to write before image to rollback segment
   (every before image, not only first after checkpoint)
   (also before image of every index page that is updated !)

I don't think that Oracle writes entire page as before image - just
tuple data and some control info. As for additional IO - we'll do it
anyway to remove before image (deleted tuple data) from data files.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim
 I think so too. I've never said that an overwriting smgr
 is easy and I don't love it particularily.
 
 What I'm objecting is to avoid UNDO without giving up
 an overwriting smgr. We shouldn't be noncommittal now. 

Why not? We could decide to do overwriting smgr later
and implement UNDO then. For the moment we could just
change checkpointer to use checkpoint.redo instead of
checkpoint.undo when defining what log files should be
deleted - it's a few minutes deal, and so is changing it
back.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

  - A simple typo in psql can currently cause a forced 
  rollback of the entire TX. UNDO should avoid this.
 
 Yes, I forgot to mention this very big advantage, but undo is
 not the only possible way to implement savepoints. Solutions
 using CommandCounter have been discussed.

This would be hell.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

   Oracle has MVCC?
  
   With restrictions, yes.
  
  What restrictions? Rollback segments size?
  Non-overwriting smgr can eat all disk space...
 
 Is'nt the same true for an overwriting smgr ? ;)

Removing dead records from rollback segments should
be faster than from datafiles.

   You didn't know that?  Vadim did ...
  
  Didn't I mention a few times that I was
  inspired by Oracle? -:)
 
 How does it do MVCC with an overwriting storage manager ?

1. System Change Number (SCN) is used: system increments it
   on each transaction commit.
2. When scan meets data block with SCN  SCN as it was when
   query/transaction started, old block image is restored
   using rollback segments.

 Could it possibly be a Postgres-inspired bolted-on hack 
 needed for better concurrency ?

-:)) Oracle has MVCC for years, probably from the beginning
and for sure before Postgres.

 BTW, are you aware how Interbase does its MVCC - is it more 
 like Oracle's way or like PostgreSQL's ?

Like ours.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Philip Warner

At 01:51 25/05/01 +0500, Hannu Krosing wrote:

How does it do MVCC with an overwriting storage manager ?


I don't know about Oracle, but Dec/RDB also does overwriting and MVCC. It
does this by taking a snapshot of pages that are participating in both RW
and RO transactions (De/RDB has the options on SET TRANSACTION that specify
if the TX will do updates or not). It has the disadvantage that the
snapshot will grow quite large for bulk loads. Typically they are about
10-20% of DB size. Pages are freed from the snapshot as active TXs finish.

Note that the snapshots are separate from the journalling (WAL) and
rollback files.



Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


  The downside would only be, that long running txn's cannot
  [easily] rollback to savepoint.
 
 We should implement savepoints for all or none transactions, no?

We should not limit transaction size to online available disk space for WAL. 
Imho that is much more important. With guaranteed undo we would need 
diskspace for more than 2x new data size (+ at least space for 1x all modified 
pages unless physical log is separated from WAL).

Imho a good design should involve only little more than 1x new data size.

 
   2. Abort long running transactions.
  
  This is imho the big downside of UNDO, and should not
  simply be put on the TODO without thorow research. I think it
  would be better to forget UNDO for long running transactions
  before aborting them.
 
 Abort could be configurable.

The point is, that you need to abort before WAL runs out of disk space
regardless of configuration.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 If community will not like UNDO then I'll probably try to implement

Imho UNDO would be great under the following circumstances:
1. The undo is only registered for some background work process
and not done in the client's backend (or only if it is a small txn).
2. The same mechanism should also be used for outdated tuples
(the only difference beeing, that some tuples need to wait longer
 because of an active xid)

The reasoning to not do it in the client's backend is not only that the client
does not need to wait, but that the nervous dba tends to kill them if after one hour 
of forward work the backend seemingly does not respond anymore (because it is
busy with undo).

 dead space collector which will read log files and so on.

Which would then only be a possible implementation variant of above :-)
First step probably would be to separate the physical log to reduce WAL size.

 to implement logging for non-btree indices (anyway required for UNDO,
 WAL-based BAR, WAL-based space reusing).

Imho it would be great to implement a generic (albeit more expensive) 
redo for all possible index types, that would be used in absence of a physical 
redo for that particular index type (which is currently available for btree).

The prerequisites would be a physical log that saves the page before 
modification. The redo could then be done (with the info from the heap tuple log 
record)
with the same index interface, that is used during normal operation.

Imho implementing a new index type is difficult enough as is without the need 
to write a redo and undo.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB

  People also have referred to an overwriting smgr
  easily. Please tell me how to introduce an overwriting smgr
  without UNDO.

There is no way. Although undo for an overwriting smgr would involve a
very different approach than with non-overwriting. See Vadim's post about what 
info suffices for undo in non overwriting smgr (file and ctid).

 I guess that is the question.  Are we heading for an overwriting storage
 manager?  I didn't see that in Vadim's list of UNDO advantages, but
 maybe that is his final goal.  If so UNDO may make sense, but then the
 question is how do we keep MVCC with an overwriting storage manager?
 
 The only way I can see doing it is to throw the old tuples into the WAL
 and have backends read through that for MVCC info.

If PostgreSQL wants to stay MVCC, then we should imho forget overwriting smgr
very fast.

Let me try to list the pros and cons that I can think of:
Pro:
no index modification if key stays same
no search for free space for update (if tuple still fits into page)
no pg_log
Con:
additional IO to write before image to rollback segment
(every before image, not only first after checkpoint)
(also before image of every index page that is updated !)
need a rollback segment that imposes all sorts of contention problems
active rollback, that needs to do a lot of undo work

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Philip Warner

At 14:33 22/05/01 -0700, Mikheev, Vadim wrote:

If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. 

I'd vote for UNDO; in terms of usability  friendliness it's a big win.
Tom's plans for FSM etc are, at least, going to get us some useful data,
and at best will mean we can hang of WAL based FSM for a few versions.





Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 If community will not like UNDO then I'll probably try to implement
 dead space collector which will read log files and so on. 
 
 I'd vote for UNDO; in terms of usability  friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in 
the advantages you see in UNDO for rollback only.

pg_log is a very big argument, but couldn't we try to change the format
to something that only stores ranges of aborted txn's in a btree like format ? 
Now that we have WAL, that should be possible.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Philip Warner

At 11:25 23/05/01 +0200, Zeugswetter Andreas SB wrote:

 If community will not like UNDO then I'll probably try to implement
 dead space collector which will read log files and so on. 
 
 I'd vote for UNDO; in terms of usability  friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in 
the advantages you see in UNDO for rollback only.

I have not been paying strict attention to this thread, so it may have
wandered into a narrower band than I think we are in, but my understanding
is that UNDO is required for partial rollback in the case of failed
commands withing a single TX. Specifically,

- A simple typo in psql can currently cause a forced rollback of the entire
TX. UNDO should avoid this.

- It is not uncommon for application in other databases to handle errors
from the database (eg. missing FKs), and continue a TX.

- Similarly, when we get a new error reporting system, general constraint
(or other) failures should be able to be handled in one TX.



Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 - A simple typo in psql can currently cause a forced rollback of the entire
 TX. UNDO should avoid this.

Yes, I forgot to mention this very big advantage, but undo is not the only possible 
way 
to implement savepoints. Solutions using CommandCounter have been discussed.
Although the pg_log mechanism would become more complex, a background
vacuum-like process could put highest priority on removing such rolled back parts
of transactions.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Tom Lane

Hiroshi Inoue [EMAIL PROTECTED] writes:
 I guess that is the question.  Are we heading for an overwriting storage
 manager?

 I've never heard that it was given up. So there seems to be
 at least a possibility to introduce it in the future.

Unless we want to abandon MVCC (which I don't), I think an overwriting
smgr is impractical.  We need a more complex space-reuse scheme than
that.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Tom Lane

Hiroshi Inoue [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Unless we want to abandon MVCC (which I don't), I think an overwriting
 smgr is impractical. 

 Impractical ? Oracle does it.

Oracle has MVCC?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB

 Todo:
 
 1. Compact log files after checkpoint (save records of uncommitted
transactions and remove/archive others).

On the grounds that undo is not guaranteed anyway (concurrent heap access),
why not simply forget it, since above sounds rather expensive ?
The downside would only be, that long running txn's cannot [easily] rollback
to savepoint.

 2. Abort long running transactions.

This is imho the big downside of UNDO, and should not simply be put on 
the TODO without thorow research. I think it would be better to forget UNDO for long 
running transactions before aborting them.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB


 As  a  rule  of  thumb,  online  applications  that hold open
 transactions during user interaction  are  considered  to  be
 Broken  By  Design  (tm).   So I'd slap the programmer/design
 team with - let's use the server box since it doesn't contain
 anything useful.

We have a database system here, and not an OLTP helper app.
A database system must support all sorts of mixed usage from simple 
OLTP to OLAP. Imho the usual separation on different servers gives more
headaches than are necessary.

Thus above statement can imho be true for one OLTP application, but not 
for all applications on one db server.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB


 Correct me if I am wrong, but both cases do present a problem currently 
 in 7.1.  The WAL log will not remove any WAL files for transactions that 
 are still open (even after a checkpoint occurs).  Thus if you do a bulk 
 insert of gigabyte size you will require a gigabyte sized WAL 
 directory.  Also if you have a simple OLTP transaction that the user 
 started and walked away from for his one week vacation, then no WAL log 
 files can be deleted until that user returns from his vacation and ends 
 his transaction.

I am not sure, it might be so implemented. But there is no technical reason
to keep them beyond checkpoint without UNDO.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB

 
 REDO in oracle is done by something known as a 'rollback segment'.  

You are not seriously saying that you like the rollback segments in Oracle.
They only cause trouble: 
1. configuration (for every different workload you need a different config) 
2. snapshot too old 
3. tx abort because rollback segments are full
4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)

If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Mikheev, Vadim

  And, I cannot say that I would implement UNDO because of
  1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
  but because of ALL of 1., 2., 4.
 
 OK, I understand your reasoning here, but I want to make a comment.
 
 Looking at the previous features you added, like subqueries, MVCC, or
 WAL, these were major features that greatly enhanced the system's
 capabilities.
 
 Now, looking at UNDO, I just don't see it in the same league as those
 other additions.  Of course, you can work on whatever you want, but I
 was hoping to see another major feature addition for 7.2.  We know we
 badly need auto-vacuum, improved replication, and point-in-time recover.

I don't like auto-vacuum approach in long term, WAL-based BAR is too easy
to do -:) (and you know that there is man who will do it, probably),
bidirectional sync replication is good to work on, but I'm more
interested in storage/transaction management now. And I'm not sure
if I'll have enough time for another major feature in 7.2 anyway.

 It would be better to put work into one mechanism that would
 reuse all tuples.

This is what we're discussing now -:)
If community will not like UNDO then I'll probably try to implement
dead space collector which will read log files and so on. Easy to
#ifdef it in 7.2 to use in 7.3 (or so) with on-disk FSM. Also, I have
to implement logging for non-btree indices (anyway required for UNDO,
WAL-based BAR, WAL-based space reusing).

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  We could keep share buffer lock (or add some other kind of lock)
  untill tuple projected - after projection we need not to read data
  for fetched tuple from shared buffer and time between fetching
  tuple and projection is very short, so keeping lock on buffer will
  not impact concurrency significantly.
 
 Or drop the pin on the buffer to show we no longer have a pointer to it.

This is not good for seqscans which will return to that buffer anyway.

  Or we could register callback cleanup function with buffer so bufmgr
  would call it when refcnt drops to 0.
 
 Hmm ... might work.  There's no guarantee that the refcnt
 would drop to zero before the current backend exits, however.
 Perhaps set a flag in the shared buffer header, and the last guy
 to drop his pin is supposed to do the cleanup?

This is what I've meant - set (register) some pointer in buffer header
to cleanup function.

 But then you'd be pushing VACUUM's work into productive transactions,
 which is probably not the way to go.

Not big work - I wouldn't worry about it.

  Two ways: hold index page lock untill heap tuple is checked
  or (rough schema) store info in shmem (just IndexTupleData.t_tid
  and flag) that an index tuple is used by some scan so cleaner could
  change stored TID (get one from prev index tuple) and set flag to
  help scan restore its current position on return.
 
 Another way is to mark the index tuple gone but not forgotten, so to
 speak --- mark it dead without removing it. (We could know that we need
 to do that if we see someone else has a buffer pin on the index page.)

Register cleanup function just like with heap above.

 None of these seem real clean though.  Needs more thought.

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  We could keep share buffer lock (or add some other kind of lock)
  untill tuple projected - after projection we need not to read data
  for fetched tuple from shared buffer and time between fetching
  tuple and projection is very short, so keeping lock on buffer will
  not impact concurrency significantly.
 
 Or drop the pin on the buffer to show we no longer have a pointer
 to it. I'm not sure that the time to do projection is short though
 --- what if there are arbitrary user-defined functions in the quals
 or the projection targetlist?

Well, while we are on this subject I finally should say about issue
bothered me for long time: only simple functions should be allowed
to deal with data in shared buffers directly. Simple means: no SQL
queries there. Why? One reason: we hold shlock on buffer while doing
seqscan qual - what if qual' SQL queries will try to acquire exclock
on the same buffer? Another reason - concurrency. I think that such
heavy functions should be provided with copy of data.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Tom Lane

Mikheev, Vadim [EMAIL PROTECTED] writes:
 I'm not sure that the time to do projection is short though
 --- what if there are arbitrary user-defined functions in the quals
 or the projection targetlist?

 Well, while we are on this subject I finally should say about issue
 bothered me for long time: only simple functions should be allowed
 to deal with data in shared buffers directly. Simple means: no SQL
 queries there. Why? One reason: we hold shlock on buffer while doing
 seqscan qual - what if qual' SQL queries will try to acquire exclock
 on the same buffer?

I think we're there already: AFAICT, user-specified quals and
projections are done after dropping the buffer shlock.  (Yes, I know
there's a HeapKeyTest inside heapgettup, but user quals don't get
done there.)  We do still hold a pin, but that seems OK to me.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Tom Lane

Vadim Mikheev [EMAIL PROTECTED] writes:
 It probably will not cause more IO than vacuum does right now.
 But unfortunately it will not reduce that IO.

Uh ... what?  Certainly it will reduce the total cost of vacuum,
because it won't bother to try to move tuples to fill holes.
The index cleanup method I've proposed should be substantially
more efficient than the existing code, as well.

 My point is that we'll need in dynamic cleanup anyway and UNDO is
 what should be implemented for dynamic cleanup of aborted changes.

UNDO might offer some other benefits, but I doubt that it will allow
us to eliminate VACUUM completely.  To do that, you would need to
keep track of free space using exact, persistent (on-disk) bookkeeping
data structures.  The overhead of that will be very substantial: more,
I predict, than the approximate approach I proposed.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


  Vadim, can you remind me what UNDO is used for?
 4. Split pg_log into small files with ability to remove old ones (which
do not hold statuses for any running transactions).

They are already small (16Mb). Or do you mean even smaller ?
This imposes one huge risk, that is already a pain in other db's. You need
all logs of one transaction online. For a GigaByte transaction like a bulk
insert this can be very inconvenient. 
Imho there should be some limit where you can choose whether you want 
to continue without the feature (no savepoint) or are automatically aborted.

In any case, imho some thought should be put into this :-)

Another case where this is a problem is a client that starts a tx, does one little
insert or update on his private table, and then sits and waits for a day.

Both cases currently impose no problem whatsoever.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


   Vadim, can you remind me what UNDO is used for?
  4. Split pg_log into small files with ability to remove old ones (which
 do not hold statuses for any running transactions).

and I wrote:
 They are already small (16Mb). Or do you mean even smaller ?

Sorry for above little confusion of pg_log with WAL on my side :-(

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 Really?! Once again: WAL records give you *physical* address of tuples
 (both heap and index ones!) to be removed and size of log to read
 records from is not comparable with size of data files.

So how about a background vacuum like process, that reads the WAL
and does the cleanup ? Seems that would be great, since it then does not 
need to scan, and does not make forground cleanup necessary.

Problem is when cleanup can not keep up with cleaning WAL files, that already 
want to be removed. I would envision a config, that sais how many Mb of WAL 
are allowed to queue up before clients are blocked.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 Would it be possible to split the WAL traffic into two sets of files,

Sure, downside is two fsyncs :-( When I first suggested physical log 
I had a separate file in mind, but that is imho only a small issue.

Of course people with more than 3 disks could benefit from a split.

Tom: If your ratio of physical pages vs WAL records is so bad, the config
should simply be changes to do fewer checkpoints (say every 20 min like a 
typical Informix setup).

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 Tom: If your ratio of physical pages vs WAL records is so bad, the config
 should simply be changes to do fewer checkpoints (say every 20 min like a 
 typical Informix setup).

I was using the default configuration.  What caused the problem was
probably not so much the standard 5-minute time-interval-driven
checkpoints, as it was the standard every-3-WAL-segments checkpoints.
Possibly we ought to increase that number?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 My point is that we'll need in dynamic cleanup anyway and UNDO is
 what should be implemented for dynamic cleanup of aborted changes.

I do not yet understand why you want to handle aborts different than outdated
tuples. The ratio in a well tuned system should well favor outdated tuples.
If someone ever adds dirty read it is also not the case that it is guaranteed, 
that nobody accesses the tuple you currently want to undo. So I really miss to see
the big difference.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  It probably will not cause more IO than vacuum does right now.
  But unfortunately it will not reduce that IO.
 
 Uh ... what?  Certainly it will reduce the total cost of vacuum,
 because it won't bother to try to move tuples to fill holes.

Oh, you're right here, but daemon will most likely read data files
again and again with in-memory FSM. Also, if we'll do partial table
scans then we'll probably re-read indices  1 time.

 The index cleanup method I've proposed should be substantially
 more efficient than the existing code, as well.

Not in IO area.

  My point is that we'll need in dynamic cleanup anyway and UNDO is
  what should be implemented for dynamic cleanup of aborted changes.
 
 UNDO might offer some other benefits, but I doubt that it will allow
 us to eliminate VACUUM completely.  To do that, you would need to

I never told this -:)

 keep track of free space using exact, persistent (on-disk) bookkeeping
 data structures.  The overhead of that will be very substantial: more,
 I predict, than the approximate approach I proposed.

I doubt that big guys use in-memory FSM. If they were able to do this...

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


  Tom: If your ratio of physical pages vs WAL records is so bad, the config
  should simply be changes to do fewer checkpoints (say every 20 min like a 
  typical Informix setup).
 
 I was using the default configuration.  What caused the problem was
 probably not so much the standard 5-minute time-interval-driven

I am quite sure, that I would increase the default to at least 15 min here.

 checkpoints, as it was the standard every-3-WAL-segments checkpoints.
 Possibly we ought to increase that number?

Here I am unfortunately not so sure with the current logic (that you can only free 
them after the checkpoint). I think the admin has to choose this. Maybe increase to 4,
but 64 Mb is quite a lot for a small installation :-(

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 I hope we can avoid on-disk FSM.  Seems to me that that would create
 problems both for performance (lots of extra disk I/O) and reliability
 (what happens if FSM is corrupted?  A restart won't fix it).

We can use WAL for FSM.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  Really?! Once again: WAL records give you *physical*
  address of tuples (both heap and index ones!) to be
  removed and size of log to read records from is not
  comparable with size of data files.
 
 So how about a background vacuum like process, that reads
 the WAL and does the cleanup ? Seems that would be great,
 since it then does not need to scan, and does not make
 forground cleanup necessary.
 
 Problem is when cleanup can not keep up with cleaning WAL
 files, that already want to be removed. I would envision a
 config, that sais how many Mb of WAL are allowed to queue
 up before clients are blocked.

Yes, some daemon could read logs and gather cleanup info.
We could activate it when switching to new log file segment
and synchronization with checkpointer is not big deal. That
daemon would also archive log files for WAL-based BAR,
if archiving is ON.

But this will be useful only with on-disk FSM.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Tom Lane

Jan Wieck [EMAIL PROTECTED] writes:
 I think the in-shared-mem FSM could have  some  max-per-table
 limit  and  the background VACUUM just skips the entire table
 as long as nobody  reused  any  space.

I was toying with the notion of trying to use Vadim's MNMB idea
(see his description of the work he did for Perlstein last year);
that is, keep track of the lowest block number of any modified block
within each relation since the last VACUUM.  Then VACUUM would only
have to scan from there to the end.  This covers the totally-untouched-
relation case nicely, and also helps a lot for large rels that you're
mostly just adding to or perhaps updating recent additions.

The FSM could probably keep track of such info fairly easily, since
it will already be aware of which blocks it's told backends to try
to insert into.  But it would have to be told about deletes too,
which would mean more FSM access traffic and more lock contention.
Another problem (given my current view of how FSM should work) is that
rels not being used at all would not be in FSM, or would age out of it,
and so you wouldn't know that you didn't need to vacuum them.
So I'm not sure yet if it's a good idea.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

  My point is that we'll need in dynamic cleanup anyway and UNDO is
  what should be implemented for dynamic cleanup of aborted changes.
 
 I do not yet understand why you want to handle aborts different than
 outdated tuples.

Maybe because of aborted tuples have shorter Time-To-Live.
And probability to find pages for them in buffer pool is higher.

 The ratio in a well tuned system should well favor outdated tuples.
 If someone ever adds dirty read it is also not the case that it
 is guaranteed, that nobody accesses the tuple you currently want
 to undo. So I really miss to see the big difference.

It will not be guaranteed anyway as soon as we start removing
tuples without exclusive access to relation.

And, I cannot say that I would implement UNDO because of
1. (cleanup) OR 2. (savepoints) OR 4. (pg_log management)
but because of ALL of 1., 2., 4.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 From: Mikheev, Vadim 
 Sent: Monday, May 21, 2001 10:23 AM
 To: 'Jan Wieck'; Tom Lane
 Cc: The Hermit Hacker; 'Bruce Momjian';
 [EMAIL PROTECTED]

Strange address, Jan?

 Subject: RE: [HACKERS] Plans for solving the VACUUM problem
 
 
  I think the in-shared-mem FSM could have  some  max-per-table
  limit  and  the background VACUUM just skips the entire table
  as long as nobody  reused  any  space.  Also  it  might  only
  compact pages that lead to 25 or more percent of freespace in
  the first place. That makes it more likely  that  if  someone
  looks  for  a place to store a tuple that it'll fit into that
  block (remember that the toaster tries to  keep  main  tuples
  below BLKSZ/4).
 
 This should be configurable parameter like PCFREE (or something
 like that) in Oracle: consider page for insertion only if it's
 PCFREE % empty.
 
 Vadim
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 Correct me if I am wrong, but both cases do present a problem
 currently in 7.1.  The WAL log will not remove any WAL files
 for transactions that are still open (even after a checkpoint
 occurs). Thus if you do a bulk insert of gigabyte size you will
 require a gigabyte sized WAL directory. Also if you have a simple
 OLTP transaction that the user started and walked away from for
 his one week vacation, then no WAL log files can be deleted until
 that user returns from his vacation and ends his transaction.

Todo:

1. Compact log files after checkpoint (save records of uncommitted
   transactions and remove/archive others).
2. Abort long running transactions.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Hm.  On the other hand, relying on WAL for undo means you cannot drop
 old WAL segments that contain records for any open transaction.  We've
 already seen several complaints that the WAL logs grow unmanageably huge
 when there is a long-running transaction, and I think we'll see a lot
 more.
 
 It would be nicer if we could drop WAL records after a checkpoint or two,
 even in the presence of long-running transactions.  We could do that if
 we were only relying on them for crash recovery and not for UNDO.

As you understand this is old, well-known problem in database practice,
described in books. Two ways - either abort too long running transactions
or (/and) compact old log segments: fetch and save (to use for undo)
records of long-running transactions and remove other records. Neither
way is perfect but nothing is perfect at all -:)

 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
 make lightweight VACUUM work well.  (I definitely don't like the idea

Sorry, but I'm going to consider background vacuum as temporary solution
only. As I've already pointed, original PG authors finally became
disillusioned with the same approach. What is good in using UNDO for 1.
is the fact that WAL records give you *direct* physical access to changes
which should be rolled back.

 that after a very long transaction fails and aborts, I'd have to wait
 another very long time for UNDO to do its thing before I could get on
 with my work.  Would much rather have the space reclamation happen in
 background.)

Understandable, but why other transactions should read dirty data again
and again waiting for background vacuum? I think aborted transaction
should take some responsibility for mess made by them -:)
And keeping in mind 2. very long transactions could be continued -:)

 2. SAVEPOINTs would be awfully nice to have, I agree.
 
 3. Reusing xact IDs would be nice, but there's an answer with a lot less
 impact on the system: go to 8-byte xact IDs.  Having to shut down the
 postmaster when you approach the 4Gb transaction mark isn't going to
 impress people who want a 24x7 commitment, anyway.

+8 bytes in tuple header is not so tiny thing.

 4. Recycling pg_log would be nice too, but we've already discussed other
 hacks that might allow pg_log to be kept finite without depending on
 UNDO (or requiring postmaster restarts, IIRC).

We did... and didn't get agreement.

 I'm sort of thinking that undoing back to a savepoint is the only real
 usefulness of WAL-based UNDO. Is it practical to preserve the WAL log
 just back to the last savepoint in each xact, not the whole xact?

No, it's not. It's not possible in overwriting systems at all - all
transaction records are required.

 Another thought: do we need WAL UNDO at all to implement savepoints?
 Is there some way we could do them like nested transactions, wherein
 each savepoint-to-savepoint segment is given its own transaction number?
 Committing multiple xact IDs at once might be a little tricky, but it
 seems like a narrow, soluble problem.

Implicit savepoints wouldn't be possible - this is very convenient
feature I've found in Oracle.
And additional code in tqual.c wouldn't be good addition.

 Implementing UNDO without creating lots of performance issues looks
 a lot harder.

What *performance* issues?!
The only issue is additional disk requirements.

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Tom Lane

Vadim Mikheev [EMAIL PROTECTED] writes:
 1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
 make lightweight VACUUM work well.

 Sorry, but I'm going to consider background vacuum as temporary solution
 only. As I've already pointed, original PG authors finally became
 disillusioned with the same approach.

How could they become disillusioned with it, when they never tried it?
I know of no evidence that any version of PG has had backgroundable
(non-blocking-to-other-transactions) VACUUM, still less within-relation
space recycling.  They may have become disillusioned with the form of
VACUUM that they actually had (ie, the same one we've inherited) --- but
please don't call that the same approach I'm proposing.

Certainly, doing VACUUM this way is an experiment that may fail, or may
require further work before it really works well.  But I'd appreciate it
if you wouldn't prejudge the results of the experiment.

 Would much rather have the space reclamation happen in
 background.)

 Understandable, but why other transactions should read dirty data again
 and again waiting for background vacuum? I think aborted transaction
 should take some responsibility for mess made by them -:)

They might read it again and again before the failed xact gets around to
removing the data, too.  You cannot rely on UNDO for correctness; at
most it can be a speed/space optimization.  I see no reason to assume
that it's a more effective optimization than a background vacuum
process.

 3. Reusing xact IDs would be nice, but there's an answer with a lot less
 impact on the system: go to 8-byte xact IDs.

 +8 bytes in tuple header is not so tiny thing.

Agreed, but the people who need 8-byte IDs are not running small
installations.  I think they'd sooner pay a little more in disk space
than risk costs in performance or reliability.

 Another thought: do we need WAL UNDO at all to implement savepoints?
 Is there some way we could do them like nested transactions, wherein
 each savepoint-to-savepoint segment is given its own transaction number?

 Implicit savepoints wouldn't be possible - this is very convenient
 feature I've found in Oracle.

Why not?  Seems to me that establishing implicit savepoints is just a
user-interface issue; you can do it, or not do it, regardless of the
underlying mechanism.

 Implementing UNDO without creating lots of performance issues looks
 a lot harder.

 What *performance* issues?!
 The only issue is additional disk requirements.

Not so.  UNDO does failed-transaction cleanup work in the interactive
backends, where it necessarily delays clients who might otherwise be
issuing their next command.  A VACUUM-based approach does the cleanup
work in the background.  Same work, more or less, but it's not in the
clients' critical path.

BTW, UNDO for failed transactions alone will not eliminate the need for
VACUUM.  Will you also make successful transactions go back and
physically remove the tuples they deleted?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

  1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
  make lightweight VACUUM work well.
 
  Sorry, but I'm going to consider background vacuum as temporary solution
  only. As I've already pointed, original PG authors finally became
  disillusioned with the same approach.
 
 How could they become disillusioned with it, when they never tried it?
 I know of no evidence that any version of PG has had backgroundable
 (non-blocking-to-other-transactions) VACUUM, still less within-relation
 space recycling.  They may have become disillusioned with the form of
 VACUUM that they actually had (ie, the same one we've inherited) --- but
 please don't call that the same approach I'm proposing.

Pre-Postgres'95 (original) versions had vacuum daemon running in
background. I don't know if that vacuum shrinked relations or not
(there was no shrinking in '95 version), I know that daemon had to
do some extra work in moving old tuples to archival storage, but
anyway as you can read in old papers in the case of consistent heavy
load daemon was not able to cleanup storage fast enough. And the
reason is obvious - no matter how optimized your daemon will be
(in regard to blocking other transactions etc), it will have to
perform huge amount of IO just to find space available for reclaiming.

 Certainly, doing VACUUM this way is an experiment that may fail, or may
 require further work before it really works well. But I'd appreciate it
 if you wouldn't prejudge the results of the experiment.

Why not, Tom? Why shouldn't I say my opinion?
Last summer your comment about WAL, may experiment that time, was that
it will save just a few fsyncs. It was your right to make prejudment,
what's wrong with my rights? And you appealed to old papers as well, BTW.

  Understandable, but why other transactions should read dirty data again
  and again waiting for background vacuum? I think aborted transaction
  should take some responsibility for mess made by them -:)
 
 They might read it again and again before the failed xact gets around to
 removing the data, too.  You cannot rely on UNDO for correctness; at
 most it can be a speed/space optimization. I see no reason to assume
 that it's a more effective optimization than a background vacuum
 process.

Really?! Once again: WAL records give you *physical* address of tuples
(both heap and index ones!) to be removed and size of log to read
records from is not comparable with size of data files.

  Another thought: do we need WAL UNDO at all to implement savepoints?
  Is there some way we could do them like nested transactions, wherein
  each savepoint-to-savepoint segment is given its own transaction number?
 
  Implicit savepoints wouldn't be possible - this is very convenient
  feature I've found in Oracle.
 
 Why not?  Seems to me that establishing implicit savepoints is just a
 user-interface issue; you can do it, or not do it, regardless of the
 underlying mechanism.

Implicit savepoints are setted by server automatically before each
query execution - you wouldn't use transaction IDs for this.

  Implementing UNDO without creating lots of performance issues looks
  a lot harder.
 
  What *performance* issues?!
  The only issue is additional disk requirements.
 
 Not so. UNDO does failed-transaction cleanup work in the interactive
 backends, where it necessarily delays clients who might otherwise be
 issuing their next command.  A VACUUM-based approach does the cleanup
 work in the background. Same work, more or less, but it's not in the
 clients' critical path.

Not same work but much more and in the critical pathes of all clients.
And - is overall performance of Oracle or Informix worse then in PG?
Seems delays in clients for rollback doesn't affect performance so much.
But dirty storage does it.

 BTW, UNDO for failed transactions alone will not eliminate the need for
 VACUUM.  Will you also make successful transactions go back and
 physically remove the tuples they deleted?

They can't do this, as you know pretty well. But using WAL to get TIDs to
be deleted is considerable, no?

Vadim



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Were you going to use WAL to get free space from old copies too?

Considerable approach.

 Vadim, I think I am missing something.  You mentioned UNDO would be used
 for these cases and I don't understand the purpose of adding what would
 seem to be a pretty complex capability:

Yeh, we already won title of most advanced among simple databases, -:)
Yes, looking in list of IDs assigned to single transaction in tqual.c is much
easy to do than UNDO. As well as couple of fsyncs is easy than WAL.

  1. Reclaim space allocated by aborted transactions.
 
 Is there really a lot to be saved here vs. old tuples of committed
 transactions?

Are you able to protect COPY FROM from abort/crash?

Vadim



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Tom Lane

Vadim Mikheev [EMAIL PROTECTED] writes:
 Really?! Once again: WAL records give you *physical* address of tuples
 (both heap and index ones!) to be removed and size of log to read
 records from is not comparable with size of data files.

You sure?  With our current approach of dumping data pages into the WAL
on first change since checkpoint (and doing so again after each
checkpoint) it's not too difficult to devise scenarios where the WAL log
is *larger* than the affected datafiles ... and can't be truncated until
someone commits.

The copied-data-page traffic is the worst problem with our current
WAL implementation.  I did some measurements last week on VACUUM of a
test table (the accounts table from a pg_bench -s 10 setup, which
contains 100 rows; I updated 2 rows and then vacuumed).  This
generated about 34400 8k blocks of WAL traffic, of which about 33300
represented copied pages and the other 1100 blocks were actual WAL
entries.  That's a pretty massive I/O overhead, considering the table
itself was under 2 8k blocks.  It was also interesting to note that
a large fraction of the CPU time was spent calculating CRCs on the WAL
data.

Would it be possible to split the WAL traffic into two sets of files,
one for WAL log records proper and one for copied pages?  Seems like
we could recycle the pages after each checkpoint rather than hanging
onto them until the associated transactions commit.

 Why not?  Seems to me that establishing implicit savepoints is just a
 user-interface issue; you can do it, or not do it, regardless of the
 underlying mechanism.

 Implicit savepoints are setted by server automatically before each
 query execution - you wouldn't use transaction IDs for this.

If the user asked you to, I don't see why not.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread The Hermit Hacker

On Sun, 20 May 2001, Vadim Mikheev wrote:

   1. Space reclamation via UNDO doesn't excite me a whole lot, if we can
   make lightweight VACUUM work well.
 
   Sorry, but I'm going to consider background vacuum as temporary solution
   only. As I've already pointed, original PG authors finally became
   disillusioned with the same approach.
 
  How could they become disillusioned with it, when they never tried it?
  I know of no evidence that any version of PG has had backgroundable
  (non-blocking-to-other-transactions) VACUUM, still less within-relation
  space recycling.  They may have become disillusioned with the form of
  VACUUM that they actually had (ie, the same one we've inherited) --- but
  please don't call that the same approach I'm proposing.

 Pre-Postgres'95 (original) versions had vacuum daemon running in
 background. I don't know if that vacuum shrinked relations or not
 (there was no shrinking in '95 version), I know that daemon had to
 do some extra work in moving old tuples to archival storage, but
 anyway as you can read in old papers in the case of consistent heavy
 load daemon was not able to cleanup storage fast enough. And the
 reason is obvious - no matter how optimized your daemon will be
 (in regard to blocking other transactions etc), it will have to
 perform huge amount of IO just to find space available for reclaiming.

  Certainly, doing VACUUM this way is an experiment that may fail, or may
  require further work before it really works well. But I'd appreciate it
  if you wouldn't prejudge the results of the experiment.

 Why not, Tom? Why shouldn't I say my opinion?
 Last summer your comment about WAL, may experiment that time, was that
 it will save just a few fsyncs. It was your right to make prejudment,
 what's wrong with my rights? And you appealed to old papers as well, BTW.

If its an experiment, shouldn't it be done outside of the main source
tree, with adequate testing in a high load situation, with a patch
released to the community for further testing/comments, before it is added
to the source tree?  From reading Vadim's comment above (re:
pre-Postgres95), this daemonized approach would cause a high I/O load on
the server in a situation where there are *alot* of UPDATE/DELETEs
happening to the database, which should be easily recreatable, no?  Or,
Vadim, am I misundertanding?



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Tom Lane

The Hermit Hacker [EMAIL PROTECTED] writes:
 If its an experiment, shouldn't it be done outside of the main source
 tree, with adequate testing in a high load situation, with a patch
 released to the community for further testing/comments, before it is added
 to the source tree?

Mebbe we should've handled WAL that way too ;-)

Seriously, I don't think that my proposed changes need be treated with
quite that much suspicion.  The only part that is really intrusive is
the shared-memory free-heap-space-management change.  But AFAICT that
will be a necessary component of *any* approach to getting rid of
VACUUM.  We've been arguing here, in essence, about whether a background
or on-line approach to finding free space will be more useful; but that
still leaves you with the question of what you do with the free space
after you've found it.  Without some kind of shared free space map,
there's not anything you can do except have the process that found the
space do tuple moving and file truncation --- ie, VACUUM.  So even if
I'm quite wrong about the effectiveness of a background VACUUM, the FSM
code will still be needed: an UNDO-style approach is also going to need
an FSM to do anything with the free space it finds.  It's equally clear
that the index AMs have to support index tuple deletion without
exclusive lock, or we'll still have blocking problems during free-space
cleanup, no matter what drives that cleanup.  The only part of what
I've proposed that might end up getting relegated to the scrap heap is
the lazy vacuum command itself, which will be a self-contained and
relatively small module (smaller than the present commands/vacuum.c,
for sure).

Besides which, Vadim has already said that he won't have time to do
anything about space reclamation before 7.2.  So even if background
vacuum does end up getting superseded by something better, we're going
to need it for a release or two ...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

  Really?! Once again: WAL records give you *physical* address of tuples
  (both heap and index ones!) to be removed and size of log to read
  records from is not comparable with size of data files.
 
 You sure?  With our current approach of dumping data pages into the WAL
 on first change since checkpoint (and doing so again after each
 checkpoint) it's not too difficult to devise scenarios where the WAL log
 is *larger* than the affected datafiles ... and can't be truncated until
 someone commits.

Yes, but note mine size of log to read records from - each log record
has pointer to previous record made by same transaction: rollback must
not read entire log file to get all records of specific transaction.

  Why not?  Seems to me that establishing implicit savepoints is just a
  user-interface issue; you can do it, or not do it, regardless of the
  underlying mechanism.
 
  Implicit savepoints are setted by server automatically before each
  query execution - you wouldn't use transaction IDs for this.
 
 If the user asked you to, I don't see why not.

Example of one of implicit savepoint usage: skipping duplicate key insertion.
Using transaction IDs when someone want to insert a few thousand records?

Vadim



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 If its an experiment, shouldn't it be done outside of the main source
 tree, with adequate testing in a high load situation, with a patch
 released to the community for further testing/comments, before it is added
 to the source tree?  From reading Vadim's comment above (re:
 pre-Postgres95), this daemonized approach would cause a high I/O load on
 the server in a situation where there are *alot* of UPDATE/DELETEs
 happening to the database, which should be easily recreatable, no?  Or,
 Vadim, am I misundertanding?

It probably will not cause more IO than vacuum does right now.
But unfortunately it will not reduce that IO. Cleanup work will be spreaded
in time and users will not experience long lockouts but average impact
on overall system throughput will be same (or maybe higher).
My point is that we'll need in dynamic cleanup anyway and UNDO is
what should be implemented for dynamic cleanup of aborted changes.
Plus UNDO gives us natural implementation of savepoints and some
abilities in transaction IDs management, which we may use or not
(though, 4. - pg_log size management - is really good thing).

Vadim



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Vadim Mikheev

 Seriously, I don't think that my proposed changes need be treated with
 quite that much suspicion.  The only part that is really intrusive is

Agreed. I fight for UNDO, not against background vacuum -:)

 the shared-memory free-heap-space-management change.  But AFAICT that
 will be a necessary component of *any* approach to getting rid of
 VACUUM.  We've been arguing here, in essence, about whether a background
 or on-line approach to finding free space will be more useful; but that
 still leaves you with the question of what you do with the free space
 after you've found it.  Without some kind of shared free space map,
 there's not anything you can do except have the process that found the
 space do tuple moving and file truncation --- ie, VACUUM.  So even if
 I'm quite wrong about the effectiveness of a background VACUUM, the FSM
 code will still be needed: an UNDO-style approach is also going to need
 an FSM to do anything with the free space it finds.  It's equally clear

Unfortunately, I think that we'll need in on-disk FSM and that FSM is
actually the most complex thing to do in space reclamation project.

 Besides which, Vadim has already said that he won't have time to do
 anything about space reclamation before 7.2.  So even if background
 vacuum does end up getting superseded by something better, we're going
 to need it for a release or two ...

Yes.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-20 Thread Tom Lane

Vadim Mikheev [EMAIL PROTECTED] writes:
 Unfortunately, I think that we'll need in on-disk FSM and that FSM is
 actually the most complex thing to do in space reclamation project.

I hope we can avoid on-disk FSM.  Seems to me that that would create
problems both for performance (lots of extra disk I/O) and reliability
(what happens if FSM is corrupted?  A restart won't fix it).

But, if we do need it, most of the work needed to install FSM APIs
should carry over.  So I still don't see an objection to doing
in-memory FSM as a first step.


BTW, I was digging through the old Postgres papers this afternoon,
to refresh my memory about what they actually said about VACUUM.
I was interested to discover that at one time the tuple-insertion
algorithm went as follows:
  1. Pick a page at random in the relation, read it in, and see if it
 has enough free space.  Repeat up to three times.
  2. If #1 fails to find space, append tuple at end.
When they got around to doing some performance measurement, they
discovered that step #1 was a serious loser, and dropped it in favor
of pure #2 (which is what we still have today).  Food for thought.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-19 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 In fact, multi-query transactions are just a special case of
 subtransactions, where all previous subtransactions are
 committed/visible.  We could use the same pg_log-style memory area for
 multi-query transactions, eliminating the command counter  and saving 8
 bytes overhead per tuple.

Interesting thought, but command IDs don't act the same as transactions;
in particular, visibility of one scan to another doesn't necessarily
depend on whether the scan has finished.

Possibly that could be taken into account by having different rules for
do we think it's committed in the local pg_log than the global one.

Also, this distinction would propagate out of the xact status code;
for example, it wouldn't do for heapam to set the known committed
bit on a tuple just because it's from a previous subtransaction of the
current xact.  Right now that works because heapam knows the difference
between xacts and commands; it would still have to know the difference.

A much more significant objection is that such a design would eat xact
IDs at a tremendous rate, to no purpose.  CommandCounterIncrement is a
cheap operation now, and we do it with abandon.  It would not be cheap
if it implied allocating a new xact ID that would eventually need to be
marked committed.  I don't mind allocating a new xact ID for each
explicitly-created savepoint, but a new ID per CommandCounterIncrement
is a different story.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Philip Warner [EMAIL PROTECTED] writes:
 At 19:05 17/05/01 -0400, Tom Lane wrote:
 1. Forget moving tuples from one page to another.

 Could this be done opportunistically, meaning it builds up a list of
 candidates to move (perhaps based on emptiness of page), then moves a
 subset of these in each pass?

Well, if we move tuples at all then we have a considerably different
animal: to move tuples across pages you must be a transaction so that
you can have an atomic commit for both pages, and that brings back the
issue of how long the transaction runs for and how large its WAL trail
will grow before it can be dropped.  Yeah, you could move a limited
number of tuples, commit, and start again ... but it's not so
lightweight anymore.

Perhaps we will eventually end up with three strengths of VACUUM:
the existing heavy-duty form, the lazy form that isn't transactional,
and an intermediate form that is willing to move tuples in simpler
cases (none of that tuple-chain-moving stuff please ;-)).  But I'm
not buying into producing the intermediate form in this go-round.
Let's build the lazy form first and get some experience with it
before we decide if we need yet another kind of VACUUM.

 To do that, I propose a free space map (FSM) kept in shared memory, which
 will tell backends which pages of a relation have free space.  Only if the
 FSM shows no free space available will the relation be extended to insert
 a new or updated tuple.

 I assume that now is not a good time to bring up memory-mapped files? ;-}

Don't see the relevance exactly ...

 Were you planning on just a free byte count, or something smaller? Dec/RDB
 uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
 where 0=empty, and 1, 2  3 indicate above/below thresholds (3 is also
 considered 'full'). The thresholds are usually set based on average record
 sizes. In this day  age, I suspect a 1 byte percentage, or 2 byte count is
 OK unless space is really a premium.

I had toyed with two different representations of the FSM:

1. Bitmap: one bit per page in the relation, set if there's an
interesting amount of free space in the page (exact threshold ???).
DEC's approach seems to be a generalization of this.

2. Page list: page number and number of free bytes.  This is six bytes
per page represented; you could maybe compress it to 5 but I'm not sure
there's much point.

I went with #2 mainly because it adapts easily to being forced into a
limited amount of space (just forget the pages with least free space)
which is critical for a table to be kept in shared memory.  A bitmap
would be less forgiving.  #2 also gives you a better chance of going
to a page that actually has enough free space for your tuple, though
you'd still need to be prepared to find out that it no longer has
enough once you get to it.  (Whereupon, you go back to the FSM, fix
the outdated entry, and keep searching.)

 While on the subject of record keeping, it would be great if it was coded
 to collect statistics about it's own operation for Jan's stats package.

Seems like a good idea, but I've seen no details yet about that package...

 This seems to have the potential to create as many false FSM page entries
 as there are backends. Is it really that expensive to lock the FSM table
 entry, subtract a number, then unlock it?

Yes, if you are contending against N other backends to get that lock.
Remember the *whole* point of this design is to avoid locking as much
as possible.  Too many trips to the FSM could throw away the performance
advantage.

 Is it possible/worth adding a 'blocking notification' to the lock manager.
 Then VACUUM could choose to terminate/restart when someone wants to do a
 schema change. This is realy only relevant if the VACUUM will be prolonged.

Seems messier than it's worth ... the VACUUM might not be the only thing
holding off your schema update anyway, and regular transactions aren't
likely to pay any attention.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Mikheev, Vadim

 I have been thinking about the problem of VACUUM and how we 
 might fix it for 7.2.  Vadim has suggested that we should
 attack this by implementing an overwriting storage manager
 and transaction UNDO, but I'm not totally comfortable with
 that approach: it seems to me that it's an awfully large
 change in the way Postgres works. 

I'm not sure if we should implement overwriting smgr at all.
I was/is going to solve space reusing problem with non-overwriting
one, though I'm sure that we have to reimplement it ( 1 table
per data file, stored on disk FSM etc).

 Second: if VACUUM can run in the background, then there's no
 reason not to run it fairly frequently. In fact, it could become
 an automatically scheduled activity like CHECKPOINT is now,
 or perhaps even a continuously running daemon (which was the
 original conception of it at Berkeley, BTW).

And original authors concluded that daemon was very slow in
reclaiming dead space, BTW.

 3. Lazy VACUUM processes a table in five stages:
A. Scan relation looking for dead tuples;...
B. Remove index entries for the dead tuples...
C. Physically delete dead tuples and compact free space...
D. Truncate any completely-empty pages at relation's end.  
E. Create/update FSM entry for the table.
...
 If a tuple is dead, we care not whether its index entries are still
 around or not; so there's no risk to logical consistency.

What does this sentence mean? We canNOT remove dead heap tuple untill
we know that there are no index tuples referencing it and your A,B,C
reflect this, so ..?

 Another place where lazy VACUUM may be unable to do its job completely
 is in compaction of space on individual disk pages.  It can physically
 move tuples to perform compaction only if there are not currently any
 other backends with pointers into that page (which can be tested by
 looking to see if the buffer reference count is one).  Again, we punt
 and leave the space to be compacted next time if we can't do it right
 away.

We could keep share buffer lock (or add some other kind of lock)
untill tuple projected - after projection we need not to read data
for fetched tuple from shared buffer and time between fetching
tuple and projection is very short, so keeping lock on buffer will
not impact concurrency significantly.

Or we could register callback cleanup function with buffer so bufmgr
would call it when refcnt drops to 0.

 Presently, VACUUM deletes index tuples by doing a standard index
 scan and checking each returned index tuple to see if it points
 at any of the tuples to be deleted. If so, the index AM is called
 back to delete the tested index tuple. This is horribly inefficient:
...
 This is mainly a problem of a poorly chosen API. The index AMs
 should offer a bulk delete call, which is passed a sorted array
 of main-table TIDs. The loop over the index tuples should happen
 internally to the index AM.

I agreed with others who think that the main problem of index cleanup
is reading all index data pages to remove some index tuples. You told
youself about partial heap scanning - so for each scanned part of table
you'll have to read all index pages again and again - very good way to
trash buffer pool with big indices.

Well, probably it's ok for first implementation and you'll win some CPU
with bulk delete - I'm not sure how much, though, and there is more
significant issue with index cleanup if table is not locked exclusively:
concurrent index scan returns tuple (and unlock index page), heap_fetch
reads table row and find that it's dead, now index scan *must* find
current index tuple to continue, but background vacuum could already
remove that index tuple = elog(FATAL, _bt_restscan: my bits moved...);

Two ways: hold index page lock untill heap tuple is checked or (rough
schema)
store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
is used by some scan so cleaner could change stored TID (get one from prev
index tuple) and set flag to help scan restore its current position on
return.

I'm particularly interested in discussing this issue because of it must be
resolved for UNDO and chosen way will affect in what volume we'll be able
to implement dirty reads (first way doesn't allow to implement them in full
- ie selects with joins, - but good enough to resolve RI constraints
concurrency issue).

 There you have it.  If people like this, I'm prepared to commit to
 making it happen for 7.2.  Comments, objections, better ideas?

Well, my current TODO looks as (ORDER BY PRIORITY DESC):

1. UNDO;
2. New SMGR;
3. Space reusing.

and I cannot commit at this point anything about 3. So, why not to refine
vacuum if you want it. I, personally, was never be able to convince myself
to spend time for this.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Mikheev, Vadim [EMAIL PROTECTED] writes:
 If a tuple is dead, we care not whether its index entries are still
 around or not; so there's no risk to logical consistency.

 What does this sentence mean? We canNOT remove dead heap tuple untill
 we know that there are no index tuples referencing it and your A,B,C
 reflect this, so ..?

Sorry if it wasn't clear.  I meant that if the vacuum process fails
after removing an index tuple but before removing the (dead) heap tuple
it points to, there's no need to try to undo.  That state is OK, and
when we next get a chance to vacuum we'll still be able to finish
removing the heap tuple.

 Another place where lazy VACUUM may be unable to do its job completely
 is in compaction of space on individual disk pages.  It can physically
 move tuples to perform compaction only if there are not currently any
 other backends with pointers into that page (which can be tested by
 looking to see if the buffer reference count is one).  Again, we punt
 and leave the space to be compacted next time if we can't do it right
 away.

 We could keep share buffer lock (or add some other kind of lock)
 untill tuple projected - after projection we need not to read data
 for fetched tuple from shared buffer and time between fetching
 tuple and projection is very short, so keeping lock on buffer will
 not impact concurrency significantly.

Or drop the pin on the buffer to show we no longer have a pointer to it.
I'm not sure that the time to do projection is short though --- what
if there are arbitrary user-defined functions in the quals or the
projection targetlist?

 Or we could register callback cleanup function with buffer so bufmgr
 would call it when refcnt drops to 0.

Hmm ... might work.  There's no guarantee that the refcnt would drop to
zero before the current backend exits, however.  Perhaps set a flag in
the shared buffer header, and the last guy to drop his pin is supposed
to do the cleanup?  But then you'd be pushing VACUUM's work into
productive transactions, which is probably not the way to go.

 This is mainly a problem of a poorly chosen API. The index AMs
 should offer a bulk delete call, which is passed a sorted array
 of main-table TIDs. The loop over the index tuples should happen
 internally to the index AM.

 I agreed with others who think that the main problem of index cleanup
 is reading all index data pages to remove some index tuples.

For very small numbers of tuples that might be true.  But I'm not
convinced it's worth worrying about.  If there aren't many tuples to
be freed, perhaps VACUUM shouldn't do anything at all.

 Well, probably it's ok for first implementation and you'll win some CPU
 with bulk delete - I'm not sure how much, though, and there is more
 significant issue with index cleanup if table is not locked exclusively:
 concurrent index scan returns tuple (and unlock index page), heap_fetch
 reads table row and find that it's dead, now index scan *must* find
 current index tuple to continue, but background vacuum could already
 remove that index tuple = elog(FATAL, _bt_restscan: my bits moved...);

Hm.  Good point ...

 Two ways: hold index page lock untill heap tuple is checked or (rough
 schema)
 store info in shmem (just IndexTupleData.t_tid and flag) that an index tuple
 is used by some scan so cleaner could change stored TID (get one from prev
 index tuple) and set flag to help scan restore its current position on
 return.

Another way is to mark the index tuple gone but not forgotten, so to
speak --- mark it dead without removing it.  (We could know that we need
to do that if we see someone else has a buffer pin on the index page.)
In this state, the index scan coming back to work would still be allowed
to find the index tuple, but no other index scan would stop on the
tuple.  Later passes of vacuum would eventually remove the index tuple,
whenever vacuum happened to pass through at an instant where no one has
a pin on that index page.

None of these seem real clean though.  Needs more thought.

 Well, my current TODO looks as (ORDER BY PRIORITY DESC):

 1. UNDO;
 2. New SMGR;
 3. Space reusing.

 and I cannot commit at this point anything about 3. So, why not to refine
 vacuum if you want it. I, personally, was never be able to convince myself
 to spend time for this.

Okay, good.  I was worried that this idea would conflict with what you
were doing, but it seems it won't.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Nathan Myers

On Fri, May 18, 2001 at 06:10:10PM -0700, Mikheev, Vadim wrote:
  Vadim, can you remind me what UNDO is used for?
 
 Ok, last reminder -:))
 
 On transaction abort, read WAL records and undo (rollback)
 changes made in storage. Would allow:
 
 1. Reclaim space allocated by aborted transactions.
 2. Implement SAVEPOINTs.
Just to remind -:) - in the event of error discovered by server
- duplicate key, deadlock, command mistyping, etc, - transaction
will be rolled back to the nearest implicit savepoint setted
just before query execution; - or transaction can be aborted by
ROLLBACK TO savepoint_name command to some explicit savepoint
setted by user. Transaction rolled back to savepoint may be continued.
 3. Reuse transaction IDs on postmaster restart.
 4. Split pg_log into small files with ability to remove old ones (which
do not hold statuses for any running transactions).

I missed the original discussions; apologies if this has already been
beaten into the ground.  But... mightn't sub-transactions be a 
better-structured way to expose this service?

Nathan Myers
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Zeugswetter Andreas SB


   Isn't current implementation bulk delete ?
 
 No, the index AM is called separately for each index tuple to be
 deleted; more to the point, the search for deletable index tuples
 should be moved inside the index AM for performance reasons.

Wouldn't a sequential scan on the heap table be the fastest way to find
keys, that need to be deleted ?

foreach tuple in heap that can be deleted do:
foreach index
call the current index delete with constructed key and xtid

The advantage would be, that the current API would be sufficient and
it should be faster. The problem would be to create a correct key from the heap
tuple, that you can pass to the index delete function.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 foreach tuple in heap that can be deleted do:
   foreach index
   call the current index delete with constructed key and xtid

See discussion with Hiroshi.  This is much more complex than TID-based
delete and would be faster only for small numbers of tuples.  (Very
small numbers of tuples, is my gut feeling, though there's no way to
prove that without implementations of both in hand.)

A particular point worth making is that in the common case where you've
updated the same row N times (without changing its index key), the above
approach has O(N^2) runtime.  The indexscan will find all N index tuples
matching the key ... only one of which is the one you are looking for on
this iteration of the outer loop.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Zeugswetter Andreas SB


 A particular point worth making is that in the common case where you've
 updated the same row N times (without changing its index key), the above
 approach has O(N^2) runtime.  The indexscan will find all N index tuples
 matching the key ... only one of which is the one you are looking for on
 this iteration of the outer loop.

It was my understanding, that the heap xtid is part of the key now, and thus 
with a somewhat modified access, it would find the one exact row directly.
And in above case, the keys (since identical except xtid) will stick close 
together, thus caching will be good.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 It was my understanding, that the heap xtid is part of the key now,

It is not.

There was some discussion of doing that, but it fell down on the little
problem that in normal index-search cases you *don't* know the heap tid
you are looking for.

 And in above case, the keys (since identical except xtid) will stick close 
 together, thus caching will be good.

Even without key-collision problems, deleting N tuples out of a total of
M index entries will require search costs like this:

bulk delete in linear scan way:

O(M)I/O costs (read all the pages)
O(M log N)  CPU costs (lookup each TID in sorted list)

successive index probe way:

O(N log M)  I/O costs for probing index
O(N log M)  CPU costs for probing index (key comparisons)

For N  M, the latter looks like a win, but you have to keep in mind
that the constant factors hidden by the O() notation are a lot different
in the two cases.  In particular, if there are T indexentries per page,
the former I/O cost is really M/T * sequential read cost whereas the
latter is N log M * random read cost, yielding a difference in constant
factors of probably a thousand or two.  You get some benefit in the
latter case from caching the upper btree levels, but that's by
definition not a large part of the index bulk.  So where's the breakeven
point in reality?  I don't know but I suspect that it's at pretty small
N.  Certainly far less than one percent of the table, whereas I would
think that people would try to schedule VACUUMs at an interval where
they'd be reclaiming several percent of the table.

So, as I said to Hiroshi, this alternative looks to me like a possible
future refinement, not something we need to do in the first version.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Oleg Bartunov [EMAIL PROTECTED] writes:
 On Thu, 17 May 2001, Tom Lane wrote:
 We will also want to look at upgrading the non-btree index types to allow
 concurrent operations.

 am I right you plan to work with GiST indexes as well ?
 We read a paper Concurrency and Recovery in Generalized Search Trees
 by Marcel Kornacker, C. Mohan, Joseph Hellerstein
 (http://citeseer.nj.nec.com/kornacker97concurrency.html)
 and probably we could go in this direction. Right now we're working
 on adding of multi-key support to GiST.

Yes, GIST should be upgraded to do concurrency.  But I have no objection
if you want to work on multi-key support first.

My feeling is that a few releases from now we will have btree and GIST
as the preferred/well-supported index types.  Hash and rtree might go
away altogether --- AFAICS they don't do anything that's not done as
well or better by btree or GIST, so what's the point of maintaining
them?

 btw, I have a question about function gistPageAddItem in gist.c
 it just decompress - compress key and calls PageAddItem to
 write tuple. We don't understand why do we need this function -

The comment says

** Take a compressed entry, and install it on a page.  Since we now know
** where the entry will live, we decompress it and recompress it using
** that knowledge (some compression routines may want to fish around
** on the page, for example, or do something special for leaf nodes.)

Are you prepared to say that you will no longer support the ability for
GIST compression routines to do those things?  That seems shortsighted.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Mikheev, Vadim

 Vadim, can you remind me what UNDO is used for?

Ok, last reminder -:))

On transaction abort, read WAL records and undo (rollback)
changes made in storage. Would allow:

1. Reclaim space allocated by aborted transactions.
2. Implement SAVEPOINTs.
   Just to remind -:) - in the event of error discovered by server
   - duplicate key, deadlock, command mistyping, etc, - transaction
   will be rolled back to the nearest implicit savepoint setted
   just before query execution; - or transaction can be aborted by
   ROLLBACK TO savepoint_name command to some explicit savepoint
   setted by user. Transaction rolled back to savepoint may be continued.
3. Reuse transaction IDs on postmaster restart.
4. Split pg_log into small files with ability to remove old ones (which
   do not hold statuses for any running transactions).

Vadim

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 I am confused why we can't implement subtransactions as part of our
 command counter?  The counter is already 4 bytes long.  Couldn't we
 rollback to counter number X-10?

That'd work within your own transaction, but not from outside it.
After you commit, how will other backends know which command-counter
values of your transaction to believe, and which not?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Hey, I have an idea.  Can we do subtransactions as separate transactions
 (as Tom mentioned), and put the subtransaction id's in the WAL, so they
 an be safely committed/rolledback as a group?

It's not quite that easy: all the subtransactions have to commit at
*the same time* from the point of view of other xacts, or you have
consistency problems.  So there'd need to be more xact-commit mechanism
than there is now.  Snapshots are also interesting; we couldn't use a
single xact ID per backend to show the open-transaction state.

WAL doesn't really enter into it AFAICS...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



[HACKERS] Plans for solving the VACUUM problem

2001-05-17 Thread Tom Lane

I have been thinking about the problem of VACUUM and how we might fix it
for 7.2.  Vadim has suggested that we should attack this by implementing
an overwriting storage manager and transaction UNDO, but I'm not totally
comfortable with that approach: it seems to me that it's an awfully large
change in the way Postgres works.  Instead, here is a sketch of an attack
that I think fits better into the existing system structure.

First point: I don't think we need to get rid of VACUUM, exactly.  What
we want for 24x7 operation is to be able to do whatever housekeeping we
need without locking out normal transaction processing for long intervals.
We could live with routine VACUUMs if they could run in parallel with
reads and writes of the table being vacuumed.  They don't even have to run
in parallel with schema updates of the target table (CREATE/DROP INDEX,
ALTER TABLE, etc).  Schema updates aren't things you do lightly for big
tables anyhow.  So what we want is more of a background VACUUM than a
no VACUUM solution.

Second: if VACUUM can run in the background, then there's no reason not
to run it fairly frequently.  In fact, it could become an automatically
scheduled activity like CHECKPOINT is now, or perhaps even a continuously
running daemon (which was the original conception of it at Berkeley, BTW).
This is important because it means that VACUUM doesn't have to be perfect.
The existing VACUUM code goes to huge lengths to ensure that it compacts
the table as much as possible.  We don't need that; if we miss some free
space this time around, but we can expect to get it the next time (or
eventually), we can be happy.  This leads to thinking of space management
in terms of steady-state behavior, rather than the periodic big bang
approach that VACUUM represents now.

But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.  The new code would be a new command, maybe VACUUM LAZY
(or some other name entirely).

Enough handwaving, what about specifics?

1. Forget moving tuples from one page to another.  Doing that in a
transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.

2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space.  Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.

3. Lazy VACUUM processes a table in five stages:
   A. Scan relation looking for dead tuples; accumulate a list of their
  TIDs, as well as info about existing free space.  (This pass is
  completely read-only and so incurs no WAL traffic.)
   B. Remove index entries for the dead tuples.  (See below for details.)
   C. Physically delete dead tuples and compact free space on their pages.
   D. Truncate any completely-empty pages at relation's end.  (Optional,
  see below.)
   E. Create/update FSM entry for the table.
Note that this is crash-safe as long as the individual update operations
are atomic (which can be guaranteed by WAL entries for them).  If a tuple
is dead, we care not whether its index entries are still around or not;
so there's no risk to logical consistency.

4. Observe that lazy VACUUM need not really be a transaction at all, since
there's nothing it does that needs to be cancelled or undone if it is
aborted.  This means that its WAL entries do not have to hang around past
the next checkpoint, which solves the huge-WAL-space-usage problem that
people have noticed while VACUUMing large tables under 7.1.

5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table.  This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.


Free space map details
--

I envision the FSM as a shared hash table keyed by table ID, with each
entry containing a list of page numbers and free space in each such page.

The FSM is empty at system startup and is filled by lazy VACUUM as it
processes each table.  Backends then decrement/remove page entries as they
use free space.

Critical point: the FSM is only a hint and does not have to be perfectly
accurate.  It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.  This allows
us to take shortcuts in maintaining it.  In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared

Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-17 Thread Tom Lane

Matthew T. O'Connor [EMAIL PROTECTED] writes:
 Another quick thought for handling FSM contention problems.  A backend could
 give up waiting for access to the FSM after a short period of time, and just
 append it's data to the end of the file the same way it's done now.  Dunno
 if that is feasable but it seemed like an idea to me.

Mmm ... maybe, but I doubt it'd help much.  Appending a page to the file
requires grabbing some kind of lock anyway (since you can't have two
backends doing it at the same instant).  With any luck, that locking can
be merged with the locking involved in accessing the FSM.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-17 Thread Philip Warner

At 19:05 17/05/01 -0400, Tom Lane wrote:

But having said that, there's no reason to remove the existing VACUUM
code: we can keep it around for situations where you need to crunch a
table as much as possible and you can afford to lock the table while
you do it.  

It would be great if this was the *only* reason to use the old-style
VACUUM. ie. We should try to avoid a solution that has a VACCUM LAZY in
background and a recommendation to a 'VACUMM PROPERLY' once in a while.


The new code would be a new command, maybe VACUUM LAZY
(or some other name entirely).

Maybe a name that reflects it's strength/purpose: 'VACUUM
ONLINE/BACKGROUND/NOLOCKS/CONCURRENT' etc.


Enough handwaving, what about specifics?

1. Forget moving tuples from one page to another.  Doing that in a
transaction-safe way is hugely expensive and complicated.  Lazy VACUUM
will only delete dead tuples and coalesce the free space thus made
available within each page of a relation.

Could this be done opportunistically, meaning it builds up a list of
candidates to move (perhaps based on emptiness of page), then moves a
subset of these in each pass? It's only really useful in the case of a
table that has a high update load then becomes static. Which is not as
unusual as it sounds: people do archive tables by renaming them, then
create a new lean 'current' table. With the new vacuum, the static table
may end up with many half-empty pages that are never reused.


2. This does no good unless there's a provision to re-use that free space.
To do that, I propose a free space map (FSM) kept in shared memory, which
will tell backends which pages of a relation have free space.  Only if the
FSM shows no free space available will the relation be extended to insert
a new or updated tuple.

I assume that now is not a good time to bring up memory-mapped files? ;-}


3. Lazy VACUUM processes a table in five stages:
   A. Scan relation looking for dead tuples; accumulate a list of their
  TIDs, as well as info about existing free space.  (This pass is
  completely read-only and so incurs no WAL traffic.)

Were you planning on just a free byte count, or something smaller? Dec/RDB
uses a nast system of DBA-defined thresholds for each storage area: 4 bits,
where 0=empty, and 1, 2  3 indicate above/below thresholds (3 is also
considered 'full'). The thresholds are usually set based on average record
sizes. In this day  age, I suspect a 1 byte percentage, or 2 byte count is
OK unless space is really a premium.


5. Also note that there's nothing saying that lazy VACUUM must do the
entire table in one go; once it's accumulated a big enough batch of dead
tuples, it can proceed through steps B,C,D,E even though it's not scanned
the whole table.  This avoids a rather nasty problem that VACUUM has
always had with running out of memory on huge tables.

This sounds great, especially if the same approach could be adopted when/if
moving records.


Critical point: the FSM is only a hint and does not have to be perfectly
accurate.  It can omit space that's actually available without harm, and
if it claims there's more space available on a page than there actually
is, we haven't lost much except a wasted ReadBuffer cycle.

So long as you store the # of bytes (or %), that should be fine. One of the
horrors of the Dec/RDB system is that with badly set threholds you can
cycle through many pages looking for one that *really* has enough free space.

Also, would the detecting process fix the bad entry?


  This allows
us to take shortcuts in maintaining it.  In particular, we can constrain
the FSM to a prespecified size, which is critical for keeping it in shared
memory.  We just discard entries (pages or whole relations) as necessary
to keep it under budget.

Presumably keeping the 'most empty' pages?


Obviously, we'd not bother to make entries in
the first place for pages with only a little free space.  Relation entries
might be discarded on a least-recently-used basis.

You also might want to record some 'average/min/max' record size for the
table to assess when a page's free space is insufficient for the
average/minimum record size.

While on the subject of record keeping, it would be great if it was coded
to collect statistics about it's own operation for Jan's stats package.


Accesses to the FSM could create contention problems if we're not careful.
I think this can be dealt with by having each backend remember (in its
relcache entry for a table) the page number of the last page it chose from
the FSM to insert into.  That backend will keep inserting new tuples into
that same page, without touching the FSM, as long as there's room there.
Only then does it go back to the FSM, update or remove that page entry,
and choose another page to start inserting on.  This reduces the access
load on the FSM from once per tuple to once per page.  

This seems to have the potential to create as many false FSM page entries
as there are backends. Is it really that expensive to lock 

Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-17 Thread Tom Lane

Mike Mascari [EMAIL PROTECTED] writes:
 Very neat. You mention that the truncation of both heap and index 
 relations is not necessarily mandatory. Under what conditions would 
 either of them be truncated?

In the proposal as given, a heap file would be truncated if (a) it
has at least one totally empty block at the end, and (b) no other
transaction is touching the table at the instant that VACUUM is
ready to truncate it.

This would probably be fairly infrequently true, especially for
heavily used tables, but if you believe in a steady state analysis
then that's just fine.  No point in handing blocks back to the OS
only to have to allocate them again soon.

We might want to try to tilt the FSM-driven reuse of freed space
to favor space near the start of the file and avoid end blocks.
Without that, you might never get totally free blocks at the end.

The same comments hold for index blocks, with the additional problem
that the index structure would make it almost impossible to drive usage
away from the physical end-of-file.  For btrees I think it'd be
sufficient if we could recycle empty blocks for use elsewhere in the
btree structure.  Actually shrinking the index probably won't happen
short of a REINDEX.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Plans for solving the VACUUM problem

2001-05-17 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 The only question I have is about the Free Space Map.  It would seem
 better to me if we could get this map closer to the table itself, rather
 than having every table of every database mixed into the same shared
 memory area.  I can just see random table access clearing out most of
 the map cache and perhaps making it less useless.

What random access?  Read transactions will never touch the FSM at all.
As for writes, seems to me the places you are writing are exactly the
places you need info for.

You make a good point, which is that we don't want a schedule-driven
VACUUM to load FSM entries for unused tables into the map at the cost
of throwing out entries that *are* being used.  But it seems to me that
that's easily dealt with if we recognize the risk.

 It would be nice if we could store the map on the first page of the disk
 table, or store it in a flat file per table.  I know both of these ideas
 will not work,

You said it.  What's wrong with shared memory?  You can't get any closer
than shared memory: keeping maps in the files would mean you'd need to
chew up shared-buffer space to get at them.  (And what was that about
random accesses causing your maps to get dropped?  That would happen
for sure if they live in shared buffers.)

Another problem with keeping stuff in the first page: what happens when
the table gets big enough that 8k of map data isn't really enough?
With a shared-memory area, we can fairly easily allocate a variable
amount of space based on total size of a relation vs. total size of
relations under management.

It is true that a shared-memory map would be useless at system startup,
until VACUUM has run and filled in some info.  But I don't see that as
a big drawback.  People who aren't developers like us don't restart
their postmasters every five minutes.

 Another advantage of centralization is that we can record update/delete
 counters per table, helping tell vacuum where to vacuum next.  Vacuum
 roaming around looking for old tuples seems wasteful.

Indeed.  But I thought you were arguing against centralization?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])