Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Jeffrey Baker
Tom Lane wrote:
Jeffrey W. Baker [EMAIL PROTECTED] writes:
I see that Tom has already done the infrastructure work by adding
getmulti, but getmulti isn't used by nodeIndexscan.c, only
nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
in-memory bitmaps in 8.1, or will some scans still be executed the usual
way?

We aren't going to remove the existing indexscan behavior, because
bitmap scans lose the ordering of the underlying index.  There are many
situations where that ordering is important.  (See for instance the
recent changes to make MAX/MIN use that behavior.)
Would you take a patch that retained the optimized executions of plans 
returning 1 tuple and also fixed the random heap problem?

-jwb
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Neil Conway
Jeffrey Baker wrote:
Would you take a patch that retained the optimized executions of plans 
returning 1 tuple and also fixed the random heap problem?
Can you elaborate on what you're proposing? Obviously sorted b+-tree 
output is important for a lot more than just min()/max(). I don't see an 
obvious way to produce sorted output from a bitmap tree index scan 
without requiring an additional sort step (which would be rather 
pointless -- the whole point of the optimization is to avoid an 
additional sort).

-Neil
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Jeffrey Baker
Neil Conway wrote:
Jeffrey Baker wrote:
Would you take a patch that retained the optimized executions of plans 
returning 1 tuple and also fixed the random heap problem?

Can you elaborate on what you're proposing? Obviously sorted b+-tree 
output is important for a lot more than just min()/max(). I don't see an 
obvious way to produce sorted output from a bitmap tree index scan 
without requiring an additional sort step (which would be rather 
pointless -- the whole point of the optimization is to avoid an 
additional sort).
I understand the importance of returning tuples in index order for many 
plans (although I probably haven't thought of all the cases.  min/max is 
the most obvious; order by + limit is another).  The only problem I'm 
trying to solve is when an indexscan returns a large result, causing the 
heap to be visited in index order, which is to say random order, from 
the disk's perspective.  When I investigated this last year, sorting the 
intermediate result of the index scan in disk order was good for a 
reduction by two-thirds in actual execution time, and sorting the scan 
result in chunks of 1000 tuples was enough to reduce the time by half.

I'm considering one of the following courses of action:
Change nodeIndexscan.c to call index_getmulti, and to handle multiple 
tuples returned.  That code would sort the tuple array and store the 
tuples in the result in disk order.

-or-
Change the planner/executor to use the bitmap scan in all cases where 
index order is unimportant.  From my reading of the current code, the 
bitmap scan is only used in case of a join.

-jwb
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Jeffrey W. Baker
On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
 Jeffrey Baker [EMAIL PROTECTED] writes:
  Change the planner/executor to use the bitmap scan in all cases where 
  index order is unimportant.  From my reading of the current code, the 
  bitmap scan is only used in case of a join.
 
 This is a fallacy, and I think your concern is largely mistaken.  Have
 you experimented with the cases you are worried about?

Perhaps I have not stated the problem clearly.  Believe me, I have
experimented extensively with this problem.  So here's the statement of
the problem: during an index scan, the heap is visited in index order,
which can be and frequently is random order on the disk.  An index scan
that currently takes 15 minutes on my database takes only 5 when the
tuples are fetched in strict disk order, and takes 8 minutes if the
tuples are fetched in disk order in groups of 1000.  The problem exists
when the scan is expected to return a lot of tuples, but the planner
chooses an index scan anyway.  In many cases, sequential scan would be
faster than index scan + random heap visits.

 It's entirely possible that the current cost model needs adjustment to
 make the planner pick the bitmap scan in more cases.  However, it is
 easy to demonstrate (either by thought-experiment or actual trial) that
 a bitmap scan isn't necessarily a win.

I agree.  There's certainly a threshold below which the bitmap would not
make sense.  It's also possible that changing the btree scan to work in
groups of tuples instead of single tuples would make more sense, which
is why I ventured two different solution to the problem.

  The overhead of maintaining the
 bitmap is far from zero, and you don't get anything unless the resulting
 pattern of heap page fetches is significantly cleaner than before.

Bitmaps scans naturally come back in TID order.  I realize this isn't
1:1 correspondence with disk order, but it's a good start.  I also like
the way the index scan and heap scan are decoupled in the bitmap code.
It leaves room for more serious optimization of disk access, which is
relatively difficult in the synchronous, one-at-a-time btree code.

 Therefore, a patch that eliminates cost-estimation in favor of trying to
 force one choice or the other is definitely not likely to be received
 favorably ...

That's not the idea.

-jwb



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Hannu Krosing
On P, 2005-05-15 at 23:58 -0700, Jeffrey Baker wrote:

 I'm considering one of the following courses of action:
 
 Change nodeIndexscan.c to call index_getmulti, and to handle multiple 
 tuples returned.  That code would sort the tuple array and store the 
 tuples in the result in disk order.
 
 -or-
 
 Change the planner/executor to use the bitmap scan in all cases where 
 index order is unimportant.  From my reading of the current code, the 
 bitmap scan is only used in case of a join.

I think the 2nd option is probably the way to go.  Probably not _all_
cases - it's probably cheper to not build a bitmap for cases where the
index returns only a few (or just one) rows.

OTOH, some scheme, where you fill sort_mem-sized memory structure with
tuples, which are fetched in heap order but stored in that structure in
index order could also be an interesting optimisastion for sort-order
preserving btree-index scans. this could also be used for bigger sort as
first step by saving each chunk to disk and then merging them back into
result.

in pseudocode one iteretion of this could go like this

TIDARRAY = ARRAY[1000]
I = 0
FOR TID in FETC_FROM_BTREE(0, 1000):
ARRAY [I] := (TID, I)
SORT_ARRAY_ON_TID()# this must be much faster than sorting
   # whole tuples for this method to be a win
   # over bitmap_scan + sort.
   # using some self-ordering structure like
   # btree/rbtree for TIDARRAY is also an option
OUTARRAY = ARRAY[1000]
FOR (TID, I) IN TIDARRAY:
OUTARRAY[I] = FETCH_FROM_HEAP(TID)
RETURN OUTARRAY

This can probably not use bitmap scan code, but has the nice property of
doing the disk accesses in file order but still returning the result in
index order.

-- 
Hannu Krosing [EMAIL PROTECTED]


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Tom Lane
Jeffrey W. Baker [EMAIL PROTECTED] writes:
 On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
 This is a fallacy, and I think your concern is largely mistaken.  Have
 you experimented with the cases you are worried about?

 Perhaps I have not stated the problem clearly.  Believe me, I have
 experimented extensively with this problem.

Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*?
The current code is certainly capable of choosing either seqscan,
bitmap scan, or traditional index scan depending on the given query.
For example,

regression=# explain analyze select * from tenk1 where unique1 between 100 and 
1000;
QUERY PLAN
--
 Bitmap Heap Scan on tenk1  (cost=9.58..381.53 rows=930 width=244) (actual 
time=6.185..18.034 rows=901 loops=1)
   Recheck Cond: ((unique1 = 100) AND (unique1 = 1000))
   -  Bitmap Index Scan on tenk1_unique1  (cost=0.00..9.58 rows=930 width=0) 
(actual time=4.522..4.522 rows=901 loops=1)
 Index Cond: ((unique1 = 100) AND (unique1 = 1000))
 Total runtime: 23.784 ms
(5 rows)

regression=# explain analyze select * from tenk1 where unique2 between 100 and 
1000;
 QUERY PLAN
-
 Index Scan using tenk1_unique2 on tenk1  (cost=0.00..45.88 rows=805 width=244) 
(actual time=0.154..14.856 rows=901 loops=1)
   Index Cond: ((unique2 = 100) AND (unique2 = 1000))
 Total runtime: 20.331 ms
(3 rows)

(The reason these apparently-identical queries get different plans is
that the unique2 column is physically ordered, so the plain indexscan
gets a very high correlation discount.)  There are enable_foo switches
to let you force selection of any one of the three ways for testing
purposes.

 It's also possible that changing the btree scan to work in
 groups of tuples instead of single tuples would make more sense, which
 is why I ventured two different solution to the problem.

My feeling is that that would add a lot of complexity for dubious win.
The reason it's dubious is that it would penalize some cases, for
instance LIMIT-type queries where you aren't going to fetch many tuples
anyway.  I think that at least for the time being (8.1 time frame) we
should leave traditional indexscans alone and concentrate on being sure
we are getting the most we can out of the new bitmap scan code.  Only
after that dust has settled will we have hard facts with which to argue
about whether compromise behaviors might be wins.

I think the work that's most needed at the moment is to test the
bitmap-scan cost model to see if it has much to do with reality...

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Jeffrey W. Baker
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote:
 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  It's also possible that changing the btree scan to work in
  groups of tuples instead of single tuples would make more sense, which
  is why I ventured two different solution to the problem.
 
 My feeling is that that would add a lot of complexity for dubious win.
 The reason it's dubious is that it would penalize some cases, for
 instance LIMIT-type queries where you aren't going to fetch many tuples
 anyway.  I think that at least for the time being (8.1 time frame) we
 should leave traditional indexscans alone and concentrate on being sure
 we are getting the most we can out of the new bitmap scan code.  Only
 after that dust has settled will we have hard facts with which to argue
 about whether compromise behaviors might be wins.

I agree.  I'll look at how my workload behaves with CVS code.  I wasn't
proposing this for 8.1 inclusion, and the TODO isn't marked for 8.1.

 I think the work that's most needed at the moment is to test the
 bitmap-scan cost model to see if it has much to do with reality...

Alright.

-jwb

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Jeffrey W. Baker [EMAIL PROTECTED] writes:
  I see that Tom has already done the infrastructure work by adding
  getmulti, but getmulti isn't used by nodeIndexscan.c, only
  nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
  in-memory bitmaps in 8.1, or will some scans still be executed the usual
  way?
 
 We aren't going to remove the existing indexscan behavior, because
 bitmap scans lose the ordering of the underlying index.  There are many
 situations where that ordering is important.  (See for instance the
 recent changes to make MAX/MIN use that behavior.)

Hm. There are other circumstances where the ordering doesn't matter. When
there's another unrelated ORDER BY clause or merge join wrapped around the
index scan for example.

This suggests one new 8.1 optimization strategy may be to add strategic no-op
OR clauses to cause 8.1 to use a bitmapOr node.

For example something like this where flag isn't very selective (say 25%)
might run more slowly than a sequential scan because of the random access
pattern.

SELECT id,name,flag
  FROM tab
 WHERE flag
 ORDER BY name

But adding a no-op bitmapOr node like:

SELECT id,name,flag
  FROM tab
 WHERE flag
OR indexed_never_true_flag
 ORDER BY name

Might run faster, perhaps even more quickly than the sequential scan because
the bitmap avoids the random access pattern but doesn't have to read the whole
table.

-- 
greg


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Mike Rylander
On 5/16/05, Tom Lane [EMAIL PROTECTED] wrote:
 regression=# explain analyze select * from tenk1 where unique1 between 100 
 and 1000;
 QUERY PLAN
 --
  Bitmap Heap Scan on tenk1  (cost=9.58..381.53 rows=930 width=244) (actual 
 time=6.185..18.034 rows=901 loops=1)
Recheck Cond: ((unique1 = 100) AND (unique1 = 1000))
-  Bitmap Index Scan on tenk1_unique1  (cost=0.00..9.58 rows=930 width=0) 
 (actual time=4.522..4.522 rows=901 loops=1)
  Index Cond: ((unique1 = 100) AND (unique1 = 1000))
  Total runtime: 23.784 ms
 (5 rows)
 
 regression=# explain analyze select * from tenk1 where unique2 between 100 
 and 1000;
  QUERY PLAN
 -
  Index Scan using tenk1_unique2 on tenk1  (cost=0.00..45.88 rows=805 
 width=244) (actual time=0.154..14.856 rows=901 loops=1)
Index Cond: ((unique2 = 100) AND (unique2 = 1000))
  Total runtime: 20.331 ms
 (3 rows)
 

Tom (or anyone with some round tuits and CVS-tip savy), if you have a
chance at some point would you post the non-bitmap version of the
query for tenk2 from above?  I'd be very interested to see if the
bitmap forced TID order fetch actually does help, or if it's
outweighed by the bitmap setup overhead.

TIA

-- 
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-16 Thread Tom Lane
Jeffrey Baker [EMAIL PROTECTED] writes:
 Change the planner/executor to use the bitmap scan in all cases where 
 index order is unimportant.  From my reading of the current code, the 
 bitmap scan is only used in case of a join.

This is a fallacy, and I think your concern is largely mistaken.  Have
you experimented with the cases you are worried about?

It's entirely possible that the current cost model needs adjustment to
make the planner pick the bitmap scan in more cases.  However, it is
easy to demonstrate (either by thought-experiment or actual trial) that
a bitmap scan isn't necessarily a win.  The overhead of maintaining the
bitmap is far from zero, and you don't get anything unless the resulting
pattern of heap page fetches is significantly cleaner than before.
Therefore, a patch that eliminates cost-estimation in favor of trying to
force one choice or the other is definitely not likely to be received
favorably ...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] bitmap scans, btree scans, and tid order

2005-05-15 Thread Jeffrey W. Baker
About this time last year I was holding forth on the value of visiting
the heap in TID order, even when index scan tuples are randomly ordered.
Today I decided to start working on the problem stated in this TODO
item:

Fetch heap pages matching index entries in sequential order 

Rather than randomly accessing heap pages based on index
entries, mark heap pages needing access in a bitmap and do the
lookups in sequential order. Another method would be to sort
heap ctids matching the index before accessing the heap rows.

I see that Tom has already done the infrastructure work by adding
getmulti, but getmulti isn't used by nodeIndexscan.c, only
nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
in-memory bitmaps in 8.1, or will some scans still be executed the usual
way?  If the former, I'd be wasting time, but in the latter case it
would be worth it.

-jwb

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] bitmap scans, btree scans, and tid order

2005-05-15 Thread Tom Lane
Jeffrey W. Baker [EMAIL PROTECTED] writes:
 I see that Tom has already done the infrastructure work by adding
 getmulti, but getmulti isn't used by nodeIndexscan.c, only
 nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
 in-memory bitmaps in 8.1, or will some scans still be executed the usual
 way?

We aren't going to remove the existing indexscan behavior, because
bitmap scans lose the ordering of the underlying index.  There are many
situations where that ordering is important.  (See for instance the
recent changes to make MAX/MIN use that behavior.)

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]