AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-30 Thread Zeugswetter Andreas SB


So are whole pages stored in rollback segments or just
the modified data?
   
   This is implementation dependent. Storing whole pages is
   much easy to do, but obviously it's better to store just
   modified data.
  
  I am not sure it is necessarily better. Seems to be a tradeoff here.
  pros of whole pages:
  a possible merge with physical log (for first
modification of a page after checkpoint
  there would be no overhead compared to current 
since it is already written now)
 
 Using WAL as RS data storage is questionable.

No, I meant the other way around. Move the physical log pages away from WAL 
files to the rollback segment (imho snapshot area would be a better name)

  in a clever implementation a page already in the
rollback segment might satisfy the 
  modification of another row on that page, and 
thus would not need any additional io.
 
 This would be possible only if there was no commit (same SCN)
 between two modifications.

I don't think someone else's commit matters unless it touches the same page.
In that case a reader would possibly need to chain back to an older version 
inside the snapshot area, and then it gets complicated even in the whole page 
case. A good concept could probably involve both whole page and change
only, and let the optimizer decide what to do.

 But, aren't we too deep on overwriting smgr (O-smgr) implementation?

Yes, but some understanding of the possibilities needs to be sorted out 
to allow good decicsions, no ?

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Zeugswetter Andreas SB


 You mean it is restored in session that is running the transaction ?
   
   Depends on what you mean with restored. It first reads the heap page,
   sees that it needs an older version and thus reads it from the rollback 
segment.
  
  So are whole pages stored in rollback segments or just the modified data?
 
 This is implementation dependent. Storing whole pages is much easy to do,
 but obviously it's better to store just modified data.

I am not sure it is necessarily better. Seems to be a tradeoff here.
pros of whole pages:
a possible merge with physical log (for first modification of a page after 
checkpoint 
there would be no overhead compared to current since it is already 
written now)
in a clever implementation a page already in the rollback segment might 
satisfy the 
modification of another row on that page, and thus would not need any 
additional io.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-29 Thread Mikheev, Vadim

   So are whole pages stored in rollback segments or just
   the modified data?
  
  This is implementation dependent. Storing whole pages is
  much easy to do, but obviously it's better to store just
  modified data.
 
 I am not sure it is necessarily better. Seems to be a tradeoff here.
 pros of whole pages:
   a possible merge with physical log (for first
   modification of a page after checkpoint
   there would be no overhead compared to current 
   since it is already written now)

Using WAL as RS data storage is questionable.

   in a clever implementation a page already in the
   rollback segment might satisfy the 
   modification of another row on that page, and 
   thus would not need any additional io.

This would be possible only if there was no commit (same SCN)
between two modifications.

But, aren't we too deep on overwriting smgr (O-smgr) implementation?
It's doable. It has advantages in terms of IO active transactions
must do to follow MVCC. It has drawback in terms of required
disk space (and, oh yeh, it's not easy to implement -:)).
So, any other opinions about value of O-smgr?

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Zeugswetter Andreas SB


  You mean it is restored in session that is running the transaction ?

Depends on what you mean with restored. It first reads the heap page,
sees that it needs an older version and thus reads it from the rollback segment.

  
  I guess thet it could be slower than our current way of doing it.
 
 Yes, for older transactions which *really* need in *particular*
 old data, but not for newer ones. Look - now transactions have to read
 dead data again and again, even if some of them (newer) need not to see
 those data at all, and we keep dead data as long as required for other
 old transactions *just for the case* they will look there.
 But who knows?! Maybe those old transactions will not read from table
 with big amount of dead data at all! So - why keep dead data in datafiles
 for long time? This obviously affects overall system performance.

Yes, that is a good description. And old version is only required in the following 
two cases:

1. the txn that modified this tuple is still open (reader in default committed read)
2. reader is in serializable transaction isolation and has earlier xtid

Seems overwrite smgr has mainly advantages in terms of speed for operations
other than rollback.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-28 Thread Vadim Mikheev

You mean it is restored in session that is running the transaction ?
  
  Depends on what you mean with restored. It first reads the heap page,
  sees that it needs an older version and thus reads it from the rollback segment.
 
 So are whole pages stored in rollback segments or just the modified data?

This is implementation dependent. Storing whole pages is much easy to do,
but obviously it's better to store just modified data.

Vadim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-25 Thread Zeugswetter Andreas SB


   Impractical ? Oracle does it.
  
  Oracle has MVCC?
  
  With restrictions, yes.
 
 What restrictions? Rollback segments size?

No, that is not the whole story. The problem with their rollback segment approach is,
that they do not guard against overwriting a tuple version in the rollback segment. 
They simply recycle each segment in a wrap around manner.
Thus there could be an open transaction that still wanted to see a tuple version
that was already overwritten, leading to the feared snapshot too old error.

Copying their rollback segment approach is imho the last thing we want to do.

 Non-overwriting smgr can eat all disk space...
 
  You didn't know that?  Vadim did ...
 
 Didn't I mention a few times that I was inspired by Oracle? -:)

Looking at what they supply in the feature area is imho good.
Copying their technical architecture is not so good in general.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-24 Thread Mikheev, Vadim

  - A simple typo in psql can currently cause a forced 
  rollback of the entire TX. UNDO should avoid this.
 
 Yes, I forgot to mention this very big advantage, but undo is
 not the only possible way to implement savepoints. Solutions
 using CommandCounter have been discussed.

This would be hell.

Vadim

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


  The downside would only be, that long running txn's cannot
  [easily] rollback to savepoint.
 
 We should implement savepoints for all or none transactions, no?

We should not limit transaction size to online available disk space for WAL. 
Imho that is much more important. With guaranteed undo we would need 
diskspace for more than 2x new data size (+ at least space for 1x all modified 
pages unless physical log is separated from WAL).

Imho a good design should involve only little more than 1x new data size.

 
   2. Abort long running transactions.
  
  This is imho the big downside of UNDO, and should not
  simply be put on the TODO without thorow research. I think it
  would be better to forget UNDO for long running transactions
  before aborting them.
 
 Abort could be configurable.

The point is, that you need to abort before WAL runs out of disk space
regardless of configuration.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 If community will not like UNDO then I'll probably try to implement

Imho UNDO would be great under the following circumstances:
1. The undo is only registered for some background work process
and not done in the client's backend (or only if it is a small txn).
2. The same mechanism should also be used for outdated tuples
(the only difference beeing, that some tuples need to wait longer
 because of an active xid)

The reasoning to not do it in the client's backend is not only that the client
does not need to wait, but that the nervous dba tends to kill them if after one hour 
of forward work the backend seemingly does not respond anymore (because it is
busy with undo).

 dead space collector which will read log files and so on.

Which would then only be a possible implementation variant of above :-)
First step probably would be to separate the physical log to reduce WAL size.

 to implement logging for non-btree indices (anyway required for UNDO,
 WAL-based BAR, WAL-based space reusing).

Imho it would be great to implement a generic (albeit more expensive) 
redo for all possible index types, that would be used in absence of a physical 
redo for that particular index type (which is currently available for btree).

The prerequisites would be a physical log that saves the page before 
modification. The redo could then be done (with the info from the heap tuple log 
record)
with the same index interface, that is used during normal operation.

Imho implementing a new index type is difficult enough as is without the need 
to write a redo and undo.

Andreas

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB

  People also have referred to an overwriting smgr
  easily. Please tell me how to introduce an overwriting smgr
  without UNDO.

There is no way. Although undo for an overwriting smgr would involve a
very different approach than with non-overwriting. See Vadim's post about what 
info suffices for undo in non overwriting smgr (file and ctid).

 I guess that is the question.  Are we heading for an overwriting storage
 manager?  I didn't see that in Vadim's list of UNDO advantages, but
 maybe that is his final goal.  If so UNDO may make sense, but then the
 question is how do we keep MVCC with an overwriting storage manager?
 
 The only way I can see doing it is to throw the old tuples into the WAL
 and have backends read through that for MVCC info.

If PostgreSQL wants to stay MVCC, then we should imho forget overwriting smgr
very fast.

Let me try to list the pros and cons that I can think of:
Pro:
no index modification if key stays same
no search for free space for update (if tuple still fits into page)
no pg_log
Con:
additional IO to write before image to rollback segment
(every before image, not only first after checkpoint)
(also before image of every index page that is updated !)
need a rollback segment that imposes all sorts of contention problems
active rollback, that needs to do a lot of undo work

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 If community will not like UNDO then I'll probably try to implement
 dead space collector which will read log files and so on. 
 
 I'd vote for UNDO; in terms of usability  friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in 
the advantages you see in UNDO for rollback only.

pg_log is a very big argument, but couldn't we try to change the format
to something that only stores ranges of aborted txn's in a btree like format ? 
Now that we have WAL, that should be possible.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Philip Warner

At 11:25 23/05/01 +0200, Zeugswetter Andreas SB wrote:

 If community will not like UNDO then I'll probably try to implement
 dead space collector which will read log files and so on. 
 
 I'd vote for UNDO; in terms of usability  friendliness it's a big win.

Could you please try it a little more verbose ? I am very interested in 
the advantages you see in UNDO for rollback only.

I have not been paying strict attention to this thread, so it may have
wandered into a narrower band than I think we are in, but my understanding
is that UNDO is required for partial rollback in the case of failed
commands withing a single TX. Specifically,

- A simple typo in psql can currently cause a forced rollback of the entire
TX. UNDO should avoid this.

- It is not uncommon for application in other databases to handle errors
from the database (eg. missing FKs), and continue a TX.

- Similarly, when we get a new error reporting system, general constraint
(or other) failures should be able to be handled in one TX.



Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-23 Thread Zeugswetter Andreas SB


 - A simple typo in psql can currently cause a forced rollback of the entire
 TX. UNDO should avoid this.

Yes, I forgot to mention this very big advantage, but undo is not the only possible 
way 
to implement savepoints. Solutions using CommandCounter have been discussed.
Although the pg_log mechanism would become more complex, a background
vacuum-like process could put highest priority on removing such rolled back parts
of transactions.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB

 Todo:
 
 1. Compact log files after checkpoint (save records of uncommitted
transactions and remove/archive others).

On the grounds that undo is not guaranteed anyway (concurrent heap access),
why not simply forget it, since above sounds rather expensive ?
The downside would only be, that long running txn's cannot [easily] rollback
to savepoint.

 2. Abort long running transactions.

This is imho the big downside of UNDO, and should not simply be put on 
the TODO without thorow research. I think it would be better to forget UNDO for long 
running transactions before aborting them.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB


 As  a  rule  of  thumb,  online  applications  that hold open
 transactions during user interaction  are  considered  to  be
 Broken  By  Design  (tm).   So I'd slap the programmer/design
 team with - let's use the server box since it doesn't contain
 anything useful.

We have a database system here, and not an OLTP helper app.
A database system must support all sorts of mixed usage from simple 
OLTP to OLAP. Imho the usual separation on different servers gives more
headaches than are necessary.

Thus above statement can imho be true for one OLTP application, but not 
for all applications on one db server.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB


 Correct me if I am wrong, but both cases do present a problem currently 
 in 7.1.  The WAL log will not remove any WAL files for transactions that 
 are still open (even after a checkpoint occurs).  Thus if you do a bulk 
 insert of gigabyte size you will require a gigabyte sized WAL 
 directory.  Also if you have a simple OLTP transaction that the user 
 started and walked away from for his one week vacation, then no WAL log 
 files can be deleted until that user returns from his vacation and ends 
 his transaction.

I am not sure, it might be so implemented. But there is no technical reason
to keep them beyond checkpoint without UNDO.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-22 Thread Zeugswetter Andreas SB

 
 REDO in oracle is done by something known as a 'rollback segment'.  

You are not seriously saying that you like the rollback segments in Oracle.
They only cause trouble: 
1. configuration (for every different workload you need a different config) 
2. snapshot too old 
3. tx abort because rollback segments are full
4. They use up huge amounts of space (e.g. 20 Gb rollback seg for a 120 Gb SAP)

If I read the papers correctly Version 9 gets rid of Point 1 but the rest ...

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


  Vadim, can you remind me what UNDO is used for?
 4. Split pg_log into small files with ability to remove old ones (which
do not hold statuses for any running transactions).

They are already small (16Mb). Or do you mean even smaller ?
This imposes one huge risk, that is already a pain in other db's. You need
all logs of one transaction online. For a GigaByte transaction like a bulk
insert this can be very inconvenient. 
Imho there should be some limit where you can choose whether you want 
to continue without the feature (no savepoint) or are automatically aborted.

In any case, imho some thought should be put into this :-)

Another case where this is a problem is a client that starts a tx, does one little
insert or update on his private table, and then sits and waits for a day.

Both cases currently impose no problem whatsoever.

Andreas

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


   Vadim, can you remind me what UNDO is used for?
  4. Split pg_log into small files with ability to remove old ones (which
 do not hold statuses for any running transactions).

and I wrote:
 They are already small (16Mb). Or do you mean even smaller ?

Sorry for above little confusion of pg_log with WAL on my side :-(

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 Really?! Once again: WAL records give you *physical* address of tuples
 (both heap and index ones!) to be removed and size of log to read
 records from is not comparable with size of data files.

So how about a background vacuum like process, that reads the WAL
and does the cleanup ? Seems that would be great, since it then does not 
need to scan, and does not make forground cleanup necessary.

Problem is when cleanup can not keep up with cleaning WAL files, that already 
want to be removed. I would envision a config, that sais how many Mb of WAL 
are allowed to queue up before clients are blocked.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 Would it be possible to split the WAL traffic into two sets of files,

Sure, downside is two fsyncs :-( When I first suggested physical log 
I had a separate file in mind, but that is imho only a small issue.

Of course people with more than 3 disks could benefit from a split.

Tom: If your ratio of physical pages vs WAL records is so bad, the config
should simply be changes to do fewer checkpoints (say every 20 min like a 
typical Informix setup).

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 Tom: If your ratio of physical pages vs WAL records is so bad, the config
 should simply be changes to do fewer checkpoints (say every 20 min like a 
 typical Informix setup).

I was using the default configuration.  What caused the problem was
probably not so much the standard 5-minute time-interval-driven
checkpoints, as it was the standard every-3-WAL-segments checkpoints.
Possibly we ought to increase that number?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


 My point is that we'll need in dynamic cleanup anyway and UNDO is
 what should be implemented for dynamic cleanup of aborted changes.

I do not yet understand why you want to handle aborts different than outdated
tuples. The ratio in a well tuned system should well favor outdated tuples.
If someone ever adds dirty read it is also not the case that it is guaranteed, 
that nobody accesses the tuple you currently want to undo. So I really miss to see
the big difference.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Zeugswetter Andreas SB


  Tom: If your ratio of physical pages vs WAL records is so bad, the config
  should simply be changes to do fewer checkpoints (say every 20 min like a 
  typical Informix setup).
 
 I was using the default configuration.  What caused the problem was
 probably not so much the standard 5-minute time-interval-driven

I am quite sure, that I would increase the default to at least 15 min here.

 checkpoints, as it was the standard every-3-WAL-segments checkpoints.
 Possibly we ought to increase that number?

Here I am unfortunately not so sure with the current logic (that you can only free 
them after the checkpoint). I think the admin has to choose this. Maybe increase to 4,
but 64 Mb is quite a lot for a small installation :-(

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-21 Thread Mikheev, Vadim

 Correct me if I am wrong, but both cases do present a problem
 currently in 7.1.  The WAL log will not remove any WAL files
 for transactions that are still open (even after a checkpoint
 occurs). Thus if you do a bulk insert of gigabyte size you will
 require a gigabyte sized WAL directory. Also if you have a simple
 OLTP transaction that the user started and walked away from for
 his one week vacation, then no WAL log files can be deleted until
 that user returns from his vacation and ends his transaction.

Todo:

1. Compact log files after checkpoint (save records of uncommitted
   transactions and remove/archive others).
2. Abort long running transactions.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Zeugswetter Andreas SB


   Isn't current implementation bulk delete ?
 
 No, the index AM is called separately for each index tuple to be
 deleted; more to the point, the search for deletable index tuples
 should be moved inside the index AM for performance reasons.

Wouldn't a sequential scan on the heap table be the fastest way to find
keys, that need to be deleted ?

foreach tuple in heap that can be deleted do:
foreach index
call the current index delete with constructed key and xtid

The advantage would be, that the current API would be sufficient and
it should be faster. The problem would be to create a correct key from the heap
tuple, that you can pass to the index delete function.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 foreach tuple in heap that can be deleted do:
   foreach index
   call the current index delete with constructed key and xtid

See discussion with Hiroshi.  This is much more complex than TID-based
delete and would be faster only for small numbers of tuples.  (Very
small numbers of tuples, is my gut feeling, though there's no way to
prove that without implementations of both in hand.)

A particular point worth making is that in the common case where you've
updated the same row N times (without changing its index key), the above
approach has O(N^2) runtime.  The indexscan will find all N index tuples
matching the key ... only one of which is the one you are looking for on
this iteration of the outer loop.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Zeugswetter Andreas SB


 A particular point worth making is that in the common case where you've
 updated the same row N times (without changing its index key), the above
 approach has O(N^2) runtime.  The indexscan will find all N index tuples
 matching the key ... only one of which is the one you are looking for on
 this iteration of the outer loop.

It was my understanding, that the heap xtid is part of the key now, and thus 
with a somewhat modified access, it would find the one exact row directly.
And in above case, the keys (since identical except xtid) will stick close 
together, thus caching will be good.

Andreas

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: AW: AW: [HACKERS] Plans for solving the VACUUM problem

2001-05-18 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 It was my understanding, that the heap xtid is part of the key now,

It is not.

There was some discussion of doing that, but it fell down on the little
problem that in normal index-search cases you *don't* know the heap tid
you are looking for.

 And in above case, the keys (since identical except xtid) will stick close 
 together, thus caching will be good.

Even without key-collision problems, deleting N tuples out of a total of
M index entries will require search costs like this:

bulk delete in linear scan way:

O(M)I/O costs (read all the pages)
O(M log N)  CPU costs (lookup each TID in sorted list)

successive index probe way:

O(N log M)  I/O costs for probing index
O(N log M)  CPU costs for probing index (key comparisons)

For N  M, the latter looks like a win, but you have to keep in mind
that the constant factors hidden by the O() notation are a lot different
in the two cases.  In particular, if there are T indexentries per page,
the former I/O cost is really M/T * sequential read cost whereas the
latter is N log M * random read cost, yielding a difference in constant
factors of probably a thousand or two.  You get some benefit in the
latter case from caching the upper btree levels, but that's by
definition not a large part of the index bulk.  So where's the breakeven
point in reality?  I don't know but I suspect that it's at pretty small
N.  Certainly far less than one percent of the table, whereas I would
think that people would try to schedule VACUUMs at an interval where
they'd be reclaiming several percent of the table.

So, as I said to Hiroshi, this alternative looks to me like a possible
future refinement, not something we need to do in the first version.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster