Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Manfred Koizar
On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane [EMAIL PROTECTED]
wrote:
If you move items
from one page to the other in the opposite direction from the way the
scan is going, then it will miss those items.

AFAIU the (PG implementaion of the) LY method is designed to make
scans immune against problems caused by items moving right within the
same page and against page splits, i.e. items moving to a *new* right
sibling.  So making scans work with items moving to an *existing*
right sibling doesn't seem out of reach.

The code following this comment in _bt_restscan
/*
 * The item we're looking for moved right at least one page, so
 * move right.  We are careful here to pin and read-lock the next
 * non-dead page before releasing the current one.  This ensures
 * that a concurrent btbulkdelete scan cannot pass our position
 * --- if it did, it might be able to reach and delete our target
 * item before we can find it again.
 */
looks like it'd work for the case of page merging as well, as long as
we are careful to move items always from left to right.

BTW, if after having locked both pages we find that we have
super-exclusive locks, i.e. nobody else has pins on these pages, we
can reorganize much more agressively.  It might even be safe to move
items to the left page.  The parent page might need some special
handling, though.

Servus
 Manfred

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Kevin Brown
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  We already do something similar for page deletions.  Empty pages are not
  deleted right away, but they are marked with BTP_DEAD, and then deleted
  on a subsequent vacuum.  Or something like that, I don't remember the
  exact details.
 
 Right, and the reason for that is exactly that there might be a
 concurrent indexscan already in flight to the newly-dead page.
 We must wait to recycle the page until we are certain no such scans
 remain.
 
 It doesn't matter whether a concurrent indexscan visits the dead
 page or not, *because it's empty* and so there's nothing to miss.
 So there's no race condition.  But if you try to move valid data
 across pages then there is a race condition.

Hmm...

Well, REINDEX is apparently a very expensive operation right now.  But
how expensive would it be to go through the entire index and perform
the index page merge operation being discussed here, and nothing else?

If it's fast enough, might it be worthwhile to implement just this
alone as a separate maintenance command (e.g., VACUUM INDEX) that
acquires the appropriate lock (AccessExclusive, I'd expect) on the
index to prevent exactly the issues you're concerned about?

If it's fast enough even on large tables, it would be a nice
alternative to REINDEX, I'd think.


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Tom Lane
Manfred Koizar [EMAIL PROTECTED] writes:
 BTW, if after having locked both pages we find that we have
 super-exclusive locks, i.e. nobody else has pins on these pages, we
 can reorganize much more agressively.

No, we cannot.  I'm getting tired of explaining this.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-24 Thread Tom Lane
Kevin Brown [EMAIL PROTECTED] writes:
 Well, REINDEX is apparently a very expensive operation right now.  But
 how expensive would it be to go through the entire index and perform
 the index page merge operation being discussed here, and nothing else?
 If it's fast enough, might it be worthwhile to implement just this
 alone as a separate maintenance command (e.g., VACUUM INDEX) that
 acquires the appropriate lock (AccessExclusive, I'd expect) on the
 index to prevent exactly the issues you're concerned about?
 If it's fast enough even on large tables, it would be a nice
 alternative to REINDEX, I'd think.

This would work, but it's hard to tell if it'd be worthwhile short
of actually doing an implementation and field-testing it ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Simon Riggs
On Wed, 2005-12-21 at 19:07 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  While we scan, if we found two adjacent pages, both of which have less
  than (say) 40% rows, we could re-join or unsplit those pages together.
 
 Curiously enough, this has been thought of before.  It is not as easy
 as you think, or it would have been done the first time around.
 nbtree/README explains why:
 
 : We consider deleting an entire page from the btree only when it's become
 : completely empty of items.  (Merging partly-full pages would allow better
 : space reuse, but it seems impractical to move existing data items left or
 : right to make this happen --- a scan moving in the opposite direction
 : might miss the items if so.  We could do it during VACUUM FULL, though.)

Sorry, I missed that. 

Seems impractical? Complex, yes. But I think it sounds a lot simpler
than the online REINDEX scheme... which is also the reason enhancing
VACUUM FULL doesn't appeal.

During the first VACUUM index scan, we can identify pages that are
candidates for reorganisation. Then during the second physical scan that
follows we can reorg pages in the same manner we delete them.

We identify a page during VACUUM for reorg if:
- it is  20% full and we want to write this page
- the following page is  20% full and we want to write this page
- it has a higher physical block number than the following page
That way we aren't super aggressive about reorg, but the cases we do
pick have a tendency to keep the index clustered. (Perhaps that could be
settable via an index optional parameter). That definition also means we
have almost no additional I/O over what the VACUUM would have written
anyway.

Page deletion requires us to lock left, lock right, lock above. That is
exactly the same if we have identified the page for reorg, since once we
have locked left and right, we just move rows to one or both of the
those other blocks, then perform the marking half-dead.

I know you've considered these things deeply, but my viewpoint is that
when what you have is already very good the only way forward consists of
considering seemingly foolish thoughts: backtracking.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Sorry, I missed that. 

And you evidently still didn't understand it.  Locking both pages does
not fix the problem, because it doesn't guarantee that there's not a
concurrent indexscan in flight from one to the other.  If you move items
from one page to the other in the opposite direction from the way the
scan is going, then it will miss those items.  If we try to fix this by
making scans lock one page before releasing the previous, then we'll
create a bunch of deadlock cases.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Martijn van Oosterhout
On Thu, Dec 22, 2005 at 10:40:24AM -0500, Tom Lane wrote:
 And you evidently still didn't understand it.  Locking both pages does
 not fix the problem, because it doesn't guarantee that there's not a
 concurrent indexscan in flight from one to the other.  If you move items
 from one page to the other in the opposite direction from the way the
 scan is going, then it will miss those items.  If we try to fix this by
 making scans lock one page before releasing the previous, then we'll
 create a bunch of deadlock cases.

I've been thinking, maybe there's a lockless way of going about this?
Have some kind of index modification identifier that you set at the
beginning of the index scan. What you do is mark the tuples you want to
move with and IMI (I'm trying to avoid the word transaction here) and then
copy the tuples to the new page with IMI+1. Any currently running index
scan will notice the higher IMI and ignore them. When all old index
scans are done, you can remove the old ones.

It's sort of a mini-MVCC for indexes except you could probably just use
a few states: visible to all, visible to current scan, invisible to
current scan. Or use two bits to represent frozen, 1, 2 and 3. A plain
VACUUM could do the state transistions each time it moves through the
index.

The downsides are probably that it's a pain to make it work
concurrently and requires writing each index page at least twice. But
it's a thought...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpp8Aiv8Fqt8.pgp
Description: PGP signature


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Alvaro Herrera
Martijn van Oosterhout wrote:

 The downsides are probably that it's a pain to make it work
 concurrently and requires writing each index page at least twice. But
 it's a thought...

We already do something similar for page deletions.  Empty pages are not
deleted right away, but they are marked with BTP_DEAD, and then deleted
on a subsequent vacuum.  Or something like that, I don't remember the
exact details.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 We already do something similar for page deletions.  Empty pages are not
 deleted right away, but they are marked with BTP_DEAD, and then deleted
 on a subsequent vacuum.  Or something like that, I don't remember the
 exact details.

Right, and the reason for that is exactly that there might be a
concurrent indexscan already in flight to the newly-dead page.
We must wait to recycle the page until we are certain no such scans
remain.

It doesn't matter whether a concurrent indexscan visits the dead
page or not, *because it's empty* and so there's nothing to miss.
So there's no race condition.  But if you try to move valid data
across pages then there is a race condition.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-22 Thread Simon Riggs
On Thu, 2005-12-22 at 10:40 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Sorry, I missed that. 
 
 And you evidently still didn't understand it.  Locking both pages does
 not fix the problem, because it doesn't guarantee that there's not a
 concurrent indexscan in flight from one to the other.  If you move items
 from one page to the other in the opposite direction from the way the
 scan is going, then it will miss those items.  If we try to fix this by
 making scans lock one page before releasing the previous, then we'll
 create a bunch of deadlock cases.

Thank you for explaining things; I'm sorry you had to do it twice.

I'm ever optimistic, so I try to resist the temptation to stay quiet in
case I make a mistake.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Unsplitting btree index leaf pages

2005-12-21 Thread Simon Riggs
When we discussed online REINDEX recently we focused on the REINDEX
command itself rather than look at alternative approaches.

One reason to REINDEX is because of index page splits getting things out
of sequence and generally bloating the index.

When we VACUUM, each index is scanned in logical order.

While we scan, if we found two adjacent pages, both of which have less
than (say) 40% rows, we could re-join or unsplit those pages together.
The index blocks are fully locked during the read anyway and there is no
MVCC problem with moving index rows between blocks. All we have to do is
to lock both blocks, having locked them in the correct order.

The rows would always be moved to the lowest physical block id, so that
data would naturally migrate towards the start of the index file. Blocks
would then be marked half-dead just as if they had just had their last
index row removed by the vacuum.

We could start the scan by locking block 1 AND block2, then scan forward
always holding 2 locks as we go. That way we would not need to unlock
and relock the blocks to lock two blocks. The concurrency loss would not
be that great, but we would gain the ability to unsplit the two blocks
into one.

If we do this, we would could possibly avoid the need to full REINDEX
entirely.

If this method checks out we could do one of:
- make VACUUM do this always
- add an option for REINDEX: CLEAN/COMPRESS/VACUUM etc to do this upon
command only

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Unsplitting btree index leaf pages

2005-12-21 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 While we scan, if we found two adjacent pages, both of which have less
 than (say) 40% rows, we could re-join or unsplit those pages together.

Curiously enough, this has been thought of before.  It is not as easy
as you think, or it would have been done the first time around.
nbtree/README explains why:

: We consider deleting an entire page from the btree only when it's become
: completely empty of items.  (Merging partly-full pages would allow better
: space reuse, but it seems impractical to move existing data items left or
: right to make this happen --- a scan moving in the opposite direction
: might miss the items if so.  We could do it during VACUUM FULL, though.)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly