[HACKERS] Index scan ordering (performance)

2004-02-18 Thread Glen Parker
Hi all,

I'm looking at ways to improve PG performance where index scans are
concerned.  The layout of our database and our access habits really hit PG's
index code square in the chops.  Many of our queries spend way too much time
waiting for disk seeks, and I believe this is fully caused by the way PG
does index scans.

In the TODO list is an entry Order duplicate index entries by tid for
faster heap lookups.  I think this change would prove *very* beneficial for
us.  Is anybody working on this?  Is it considered important by many people?

Ordering duplicate index entries would obviously help for...  uh...
duplicate entries, but what about non-duplicates being accessed in a range
scan?  If the executer were to handle the ordering at read time, it would
work for ranges as well.  Would it be wildly difficult to have the executer
scan more than one index entry before doing the related heap accesses?  That
way you could read X number of index entries (or all of them), order the
resulting list on tid, and then go after the heap.  It seems that the only
side effect from this would be tid-ordered default sorting rather than
index-ordered default sorting (which does not effect SQL compliance).

Any ideas what status is on this stuff?  Should I dive in and try to do it
myself?

Glen Parker
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



FW: [HACKERS] Do we prefer software that works or software that looks good?

2004-04-26 Thread Glen Parker
Tom Lane wrote:

Personally I don't think that this is a transitional issue and we will
someday all be happy in upper-case-only-land.  Upper-case-only sucks,
by every known measure of readability, and I don't want to have to put up
with a database that forces that 1960s-vintage-hardware mindset on me.

I think the SQL standard is screwy here on at least two levels.  
Not only is upper case fuggly (we all seem to agree on that 
point), but I think case folding is a Bad Idea in general.  I 
think the only time you should have to quote a DB identifier is 
when it conflicts with a reserved word.  Either be case sensative 
or don't.  I'm all for the (ignore but preserve case) way of doing things.

But it IS the standard, and as such, as much as we all seem to 
dislike it, I believe it is better to follow it.  You can't just 
go around picking and choosing the standards you'll adhere to.  
Like Microsoft.  If it bothers you that much, put some effort 
into changing it.  Attain world domination and then force the 
world to bend to The Right Way.  Get rich and pay off enough 
members of the standards body to get it changed.  But until then, 
live with it.

Now, I am all for configurability, and lots of it.  By all means, 
allow us to choose how we'd like case folding to be carried out, 
or whether case folding (blech) is done at all.  While you're at 
it, allow us to choose whether NULL is  treated as 
zero/blank/empty or as SQL standard NULL.  Allow us to force the 
DB to do case-insensative comparisons on all character data.  
Allow us, as DB admins, to f*** with the standard behavior until 
we have a working mimic of MySQL or MS-SQL :-)

But I think the default behavior should adhere to the SQL 
standard as closely as possible, even when we all hate it with a passion.

Just my $.02
Glen Parker


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] PITR Phase 1 - Test results

2004-04-26 Thread Glen Parker
 I want to come hug you --- where do you live?  !!!

You're not the only one.  But we don't want to smother the poor guy, at
least not before he completes his work :-)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] proposal: be smarter about i/o patterns in index scan

2004-05-18 Thread Glen Parker
 The basic problem is that Pg seeks far too much when performing an index
 scan.  I have saved an strace of a backend which is selecting 170,000
 rows from a 240,000,000 row table via index scan.  The strace shows
 134,000 seeks and reads, or almost one seek for every tuple in the
 result set.  This would be fine normally except the seeks are in a
 *very* suboptimal pattern, swinging back and forth across the device.
 The query requires 16 minutes, 30 seconds on our test hardware.

Yes yes yes *please* fix this :-)  This performance bottle neck in PG
effects us all the time.


 The proposal is to sort the block requests before they are issued.
 Because Pg only issues single seek/read pairs synchronously, the kernel
 has no chance to apply its elevator and hence every seek/read Pg issues
 results in actual movement of the disk head.  Pg's random pattern of
 seeking also makes the kernel's readahead fairly useless.

 To prove the proposal I did a simulation, using the recorded seek
 positions in the strace.  We noted that Pg issued 403 seek/read pairs
 for every 8192 bytes read from the index.  So we performed four
 simulations: a straight playback of Pg's seek pattern, a playback where
 each list of 403 seeks is sorted by byte offset, a playback of all the
 seeks sorted by offset, and a sequential scan of the 13GB data file.

 PostgreSQL 4.2.1:  16m30s
 Sorted in chunks:  10m6s
 Sorted altogether: 4m19s
 Sequential scan:   6m7s

 As you can see, there's a lot to be gained here.  If Pg was able to
 issue its seeks in the optimal order, the query would return in 1/4th
 the time.  Even the chunk-sorted scheme is a big win.

I think your worst case may not be as bad as it gets.  Nothing scientific,
but my experience tells me that it's common to get even worse performance
than that.  I've had queries that would take seemingly forever, but then
after a fresh cluster and cache flush, the same query would take almost no
time at all.

I also think that with a multi-mirrored RAID setup and your proposed IO
sorting, the mirroring would be more likely to overcome seek overhead.  With
the current implementation, it seems almost useless to throw more hardware
at the problem.

I think the improvement would be even 'huger' than your numbers show :-)


 So the proposal is this: the offsets stored in the index ought to be
 sorted.  Either A) they can be stored sorted in the first place (sorted
 in VACUUM ANALYZE, or always sorted), or B) the backend can sort each
 list of tuples it reads from the index, or C) the backend can read the
 entire index result and sort that (this would be the best).

 From just paging through the code, it doesn't seem terribly hard.  B
 seems the easiest to implement but is also the least improvement.  Even
 seqscan is better than B.  One improvement to B would be to read much
 more than 8K from the index.  Reading e.g. 256KB would improve things
 dramatically.  A might be easy but would either degrade over time or
 cause slower inserts.  C is the best and hardest to implement (I am
 guessing, because the size of sorted index subset is unbounded).

IMO if you're going to do it, do it right.  That means A is pretty much out,
again, IMO.  It would seem that B (improved) would be the way to go because,
as you say, C would produce unbounded subsets.  I would propose to read very
large sections of the index subset though, more in the order of several MB
(configurable, of course).  With that much space to play with, most queries
would require only one pass at the index and therefore show performance
equal to C.  Larger queries would suffer a bit, but then we don't usually
have such speedy expectations of large expected result sets, and again, with
enough sort space to play with, the performance could approach that of C.

I would think that you could also sort the index on index order (during
vacuum) to improve index scan performance.  This could be a completely
seperate implementation task.  With clean sequential index access *and*
clean sequential heap access, it might just prove to be the single largest
performance boost PG has seen, well, ever.

My .02...

Glen Parker


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] proposal: be smarter about i/o patterns in index scan

2004-05-19 Thread Glen Parker
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane

 For starters, read the previous discussions of this in the archives.

Search doesn't work.  Even if it did, I'm not sure what I would search on.
Do you remember the time frame of the discussion?  If I could narrow it down
to a few months, I could possibly find it the slow way.  I'd be very
interested to read it.

Anyway, this subject seems to get precious little attention, and I don't
understand why.  PG's index scan implementation is so bad at some types of
queries, I find it surprising that there aren't weekly discussions about it,
and developers lined up around the block to work on it.  I would be one of
those developers, but frankly the learning curve is pretty steep and I don't
have any smaller complaints to use as learning tools.

What am I missing?  Why is a performance bottle neck of this magnitude not
on the same list of priorities as PITR, replication, and Win32?

 Two questions you should have answers to before starting to implement,
 rather than after, are how to deal with locking considerations and
 what will be the implications of giving up the property that indexscans
 deliver sorted output.

Here's one answer: If you had to sort every result set, even when an index
could have been used, overall performance would still improve by a very
large margin.  I'd bet money on it.

Actually, the index would still have to be scanned in sort order.  Only the
fetch order for heap tuples would be sorted differently.


Glen Parker


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] proposal: be smarter about i/o patterns in index scan

2004-05-19 Thread Glen Parker
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane
 Sent: Wednesday, May 19, 2004 11:56 AM

 Not unless you add yet another sort step after the fetch step, that is
 the idea becomes
   1. read index to get candidate TIDs
   2. sort TIDs into physical order
   3. read heap in physical order, check row visibility
   4. sort selected rows into index order

 That would start to look like an unpleasant amount of overhead, since
 the sort would have to be over the entire output; you couldn't divide
 the scan into reasonable-size batches, whereas steps 1-3 can easily
 be run in batched style.

Or:

  2. Sort AND COPY TID's into physical order.
  3. Read heap in phy. order, match results to un-sorted TID list.

That sounds quite cheap.

Then you get to drop step 4 (which would *usually* be quite a bit more
expensive than a TID sort and copy?).

The cost of the proposed implementation would be higher *when everything is
in the cache*, granted.  As a user, I will take that cost 10 times over in
return for such a large improvement in the no-cache situation.

-Glen



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] proposal: be smarter about i/o patterns in index scan

2004-05-19 Thread Glen Parker
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane

  Or:
2. Sort AND COPY TID's into physical order.
3. Read heap in phy. order, match results to un-sorted TID list.
  That sounds quite cheap.

 No, it sounds like handwaving.  What's your match results step, and

You got me :-)

 does it take less than O(N^2) time?  How do you get to *return* the

Less than O(N^2)???  Geez I think so!

 tuples in index order, without having read them all before you return
 the first one?  (Which is really what the problem is with the sort
 alternative, at least for fast-start cases such as the LIMIT 1 example.)

(you'll have to pardon me for thinking as I type... and being to long winded
:-)

It takes O(1) time.

If you have a hard maximum of TID's you'll grab from the index (which sounds
right), you could store the TID list in a vector.  The IO-order-sorted list
could be a singly-linked-list OR a vector, but either way, each entry would
have to point back to an offset in the index-ordered list.

The index-ordered list would be (sizeof(TID) * max_TID_count) bytes, and the
IO-ordered list would be (sizeof(TID) + sizeof(int) * max_TID_count) bytes.
If a TID is 8 bytes and we're going to fetch 2048 index entries per
iteration, that gets us (*counting on fingers*) 16K (for index-ordered list)
+ 24K for the IO-ordered list.

The index-ordered list can also be a singly-linked list, in which case the
mapping would be by pointer.  If both lists are single-linked lists, then
the size overhead (for a fully realized 2048 entry scan) would be:
((sizeof(TID) + sizeof(void*)) * max_TID_count)  +  ((sizeof(TID) +
sizeof(void*) + sizeof(void*)) * max_TID_count) = (*more counting on
fingers*) ... 24K + 32K = 56K.

Hmm, the vector style would be crazy expensive for small queries.  The
linked-list approach sounds pretty good though.  I assume no one thinks
doing this would be completely free :-)  And, if you did the IO-ordered list
as a vector, you'd save the linked-list overhead, and you would know exactly
how big to make it, so it wouldn't have to hurt you on small queries.

2048 entries seems pretty darn generous actually.

 How do you get to *return* the
 tuples in index order, without having read them all before you return

Hmm, worsed case scenario would be to scan the heap in exactly-reverse order
of index order.  With a 2048 entry batch size, you'd have a fair amount of
overhead on a LIMIT 1 :-)

Unless the index scanner could be given a maximum tuple count, rather than
just 'knowing' it from a GUC value.  Would this dirty up the plan tree
beyond all recognition?

 What happens when you run out of memory for the list of TIDs ... er,
 make that two lists of TIDs?

Now you really got me.  Hand waving: the same thing you do when you run out
of memory anywhere else :-)  By configuring the server to fetch index
entries in 1-entry batches, you'd be right back to the overhead (or very
nearly so) of the current implementation, so it would only hurt people who
chose to be hurt by it :-)


-Glen





---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Why frequently updated tables are an issue

2004-06-10 Thread Glen Parker
 It has been suggested in past to add such a visibility to index
 tuple header so
 that index and heaps can be cleaned out of order. In such a case
 other backround

It seems to me that the benefit of this wouldn't be all that impressive
*when accessing the cache*, which is the problem this discussion is about.
Disk access would occur more commonly with large tables, which I'll ignore.
Let's say total scan time for a query on a very dirty table is 100ms.  It
seems safe to assume that the scan time for the index would be *roughly*
half that of the heap.  If visibilty could be determined by looking at just
the index tuple, you'd cut you query scan time down to 50ms.  When the clean
table case is 7ms total scan time, the difference between 50 and 100 ms is
not much of an issue; either way, it's still way to high!

 However increasing index footprint seems to be a tough sell.


And rightly so, IMO.

Glen Parker


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Point in Time Recovery

2004-07-15 Thread Glen Parker
 Simon Riggs wrote:
   On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:
 
  Thanks for the vote of confidence, on or off list.
 
too many people spend a lot of
   money for  proprietary databases, just for some missing features in
   PostgreSQL
 
  Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
  already, even though it doesn't have it, then you are quite likely to be
  able to keep going without it.
 
  Most commercial users won't touch anything that doesn't have PITR.

 Agreed. I am surprised at how few requests we have gotten for PITR.  I
 assume people are either using replication or not considering us.

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available.  I haven't requested per se, but
believe me, I'm waiting for it :-)


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Proposal for Implenting read-only queries during wal replay (SoC 2007)

2007-02-23 Thread Glen Parker

I'll throw in my vote, I would find this quite useful.

-Glen


Florian G. Pflug wrote:

I plan to submit a proposal for implementing support for
read-only queries during wal replay as a Google Summer of Code 2007
project.

I've been browsing the postgres source-code for the last few days,
and came up with the following plan for a implementation.

I'd be very interested in any feedback on the propsoal - especially
of the you overlooked this an that, it can never work that way kind ;-)




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] more than one index in a single heap pass?

2009-07-14 Thread Glen Parker

Andrew Dunstan wrote:
Well, yes, it's some of it, and in theory Tom's late addition of a queue 
that gets all the dependencies of a table as soon as the table data is 
restored should make that work better.  But of course, that's not the 
only time indexes are created, and each index creation command will be 
doing its own heap processing, albeit that synchronised scanning will 
make that lots cheaper.


As I said originally, it was just an idle thought that came to me today.



Sounds to me like another reason to separate index definition from 
creation.  If an index can be defined but not yet created or valid, then 
you could imagine syntax like:


DEFINE INDEX blahblah1 ON mytable (some fields);
DEFINE INDEX blahblah2 ON mytable (some other fields);
[RE]INDEX TABLE mytable;

...provided that REINDEX TABLE could recreate all indexes simultaneously 
as you suggest.


-Glen


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers