Re: [PATCHES] Seq scans status update

2007-06-02 Thread Luke Lonergan
Hi All,

On 5/31/07 12:40 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote:

 BTW, we've been talking about the L2 cache effect but we don't really
 know for sure if the effect has anything to do with the L2 cache. But
 whatever it is, it's real.

The mailing list archives contain the ample evidence of:
- it's definitely an L2 cache effect
- on fast I/O hardware tests show large benefits of keeping the ring in L2

I see no reason to re-open the discussion about these, can we accept these
as fact and continue?

- Luke


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-06-02 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 The mailing list archives contain the ample evidence of:
 - it's definitely an L2 cache effect
 - on fast I/O hardware tests show large benefits of keeping the ring in L2

Please provide some specific pointers, because I don't remember that.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Seq scans status update

2007-05-31 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I just ran a quick test with 4 concurrent scans on a dual-core system, 
and it looks like we do leak buffers from the rings because they're 
pinned at the time they would be recycled.


Yeah, I noticed the same in some tests here.  I think there's not a lot
we can do about that; we don't have enough visibility into why someone
else has the buffer pinned.


We could stash pinned buffers to some other list etc. and try them again 
later. But that gets a lot more complex.



Using a larger ring would help, by making it less probable that any
other sync-scanning backend is so far behind as to still have the oldest
element of our ring pinned.  But if we do that we have the L2-cache-size
effect to worry about.  Is there any actual data backing up that it's
useful to keep the ring fitting in L2, or is that just guesswork?  In
the sync-scan case the idea seems pretty bogus anyway, because the
actual working set will be N backends' rings not just one.


Yes, I tested different ring sizes here: 
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00469.php


The tests above showed the effect when reading a table from OS cache. I 
haven't seen direct evidence supporting Luke's claim that the ring makes 
scans of tables bigger than RAM go faster with bigger I/O hardware, 
because I don't have such hardware at hand. We did repeat the tests on 
different hardware however, and monitored the CPU usage with vmstat at 
the same time. The CPU usage was significantly lower with the patch, so 
I believe that with better I/O hardware the test would become limited by 
CPU and the patch would therefore make it go faster.


BTW, we've been talking about the L2 cache effect but we don't really 
know for sure if the effect has anything to do with the L2 cache. But 
whatever it is, it's real.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

A heapscan would pin the buffer only once and hence bump its count at
most once, so I don't see a big problem here.  Also, I'd argue that
buffers that had a positive usage_count shouldn't get sucked into the
ring to begin with.


True, except that with the synchronized scans patch two synchronized 
scans will pin the buffer twice.


Hmm.  But we probably don't want the same buffer in two different
backends' rings, either.  You *sure* the sync-scan patch has no
interaction with this one?


A buffer is only put to the ring when it's requested with 
StrategyGetBuffer. If a page is requested with ReadBuffer, and it's in 
cache already, it won't be put in the ring. When multiple scanners are 
scanning synchronously, they will all use their own, distinct rings when 
reading buffers into the cache, and peek into the buffers in other 
backend's rings for pages that others read to the buffer cache.


I'm going to test the interactions in more detail...


One other question: I see the patch sets the threshold for switching
from normal to ring-buffer heapscans at table size = NBuffers.  Why
so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
If you don't want a seqscan blowing out your buffer cache, you surely
don't want it blowing out 90% of the cache either.


NBuffers is the maximum value that makes sense; if you're scanning more 
than NBuffers, the scan is definitely not going to fit in 
shared_buffers. Anything less than that and we might be causing harm to 
some use cases, so I chose that for the time being.


Simon argued for a GUC variable, and Jeff's patch as it stands 
introduces one. I'm not sure we want it but if we do, we should use the 
same variable to control both the sync scan and cache replacement 
policy. It's essentially how large a scan do you expect to fit in 
shared_buffers?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Jeff Davis
On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:
  Hmm.  But we probably don't want the same buffer in two different
  backends' rings, either.  You *sure* the sync-scan patch has no
  interaction with this one?
  
 
 I will run some tests again tonight, I think the interaction needs more
 testing than I did originally. Also, I'm not sure that the hardware I
 have is sufficient to test those cases.
 

I ran some sanity tests last night with cvs head, plus my syncscan20-
cvshead.patch, plus scan_recycle_buffers.v3.patch.

It passed the sanity tests at least.

I did see that there was more interference with sync_seqscan_threshold=0
(always on) and scan_recycle_buffers=0 (off) than I had previously seen
with 8.2.4, so I will test again against 8.2.4 to see why that might be.
The interference that I saw was still quite small, a scan moving
concurrently with 9 other scans was about 10% slower than a scan running
alone -- which is still very good compared with plain cvs head and no
sync scan -- it's just not ideal. 

However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
didn't have much effect. At 32 it appeared to be about 1% worse during
10 scans, but that may have been noise. The other values I tried didn't
have any difference that I could see.

This was really just a quick sanity test, I think more hard data would
be useful.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 One other question: I see the patch sets the threshold for switching
 from normal to ring-buffer heapscans at table size = NBuffers.  Why
 so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
 If you don't want a seqscan blowing out your buffer cache, you surely
 don't want it blowing out 90% of the cache either.

 NBuffers is the maximum value that makes sense; if you're scanning more 
 than NBuffers, the scan is definitely not going to fit in 
 shared_buffers. Anything less than that and we might be causing harm to 
 some use cases, so I chose that for the time being.

But the flip side of that is you're failing to provide the benefit of
the patch in quite a lot of use-cases where it's clearly beneficial.
I just don't believe that there are very many cases where people will
want a heapscan to eat 90% of their cache.

 Simon argued for a GUC variable, and Jeff's patch as it stands 
 introduces one. I'm not sure we want it but if we do, we should use the 
 same variable to control both the sync scan and cache replacement 
 policy. It's essentially how large a scan do you expect to fit in 
 shared_buffers?

Well, let's do some experiments and see if there's really any point in
varying the cutover.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Heikki Linnakangas

Jeff Davis wrote:

On Tue, 2007-05-29 at 17:43 -0700, Jeff Davis wrote:

Hmm.  But we probably don't want the same buffer in two different
backends' rings, either.  You *sure* the sync-scan patch has no
interaction with this one?


I will run some tests again tonight, I think the interaction needs more
testing than I did originally. Also, I'm not sure that the hardware I
have is sufficient to test those cases.



I ran some sanity tests last night with cvs head, plus my syncscan20-
cvshead.patch, plus scan_recycle_buffers.v3.patch.

It passed the sanity tests at least.

I did see that there was more interference with sync_seqscan_threshold=0
(always on) and scan_recycle_buffers=0 (off) than I had previously seen
with 8.2.4, so I will test again against 8.2.4 to see why that might be.
The interference that I saw was still quite small, a scan moving
concurrently with 9 other scans was about 10% slower than a scan running
alone -- which is still very good compared with plain cvs head and no
sync scan -- it's just not ideal. 


However, turning scan_recycle_buffers between 0 (off), 16, 32, and 128
didn't have much effect. At 32 it appeared to be about 1% worse during
10 scans, but that may have been noise. The other values I tried didn't
have any difference that I could see.

This was really just a quick sanity test, I think more hard data would
be useful.


The interesting question is whether the small buffer ring is big enough 
to let all synchronized scans to process a page before it's being 
recycled. Keep an eye on pg_buffercache to see if it gets filled with 
pages from the table you're querying.


I just ran a quick test with 4 concurrent scans on a dual-core system, 
and it looks like we do leak buffers from the rings because they're 
pinned at the time they would be recycled. A full scan of a 30GB table 
took just under 7 minutes, and starting after a postmaster restart it 
took ~4-5 minutes until all of the 320MB of shared_buffers were used. 
That means we're leaking a buffer from the ring very roughly on every 
20-40th ReadBuffer call, but I'll put in some proper instrumentation and 
 test with different configurations to get a better picture.


The synchronized scans gives such a big benefit when it's applicable, 
that I think that some cache-spoiling is acceptable and in fact 
unavoidable in some scenarios. It's much better than 8.2 behavior anyway.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I just ran a quick test with 4 concurrent scans on a dual-core system, 
 and it looks like we do leak buffers from the rings because they're 
 pinned at the time they would be recycled.

Yeah, I noticed the same in some tests here.  I think there's not a lot
we can do about that; we don't have enough visibility into why someone
else has the buffer pinned.

Using a larger ring would help, by making it less probable that any
other sync-scanning backend is so far behind as to still have the oldest
element of our ring pinned.  But if we do that we have the L2-cache-size
effect to worry about.  Is there any actual data backing up that it's
useful to keep the ring fitting in L2, or is that just guesswork?  In
the sync-scan case the idea seems pretty bogus anyway, because the
actual working set will be N backends' rings not just one.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Jeff Davis
On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
 In
 the sync-scan case the idea seems pretty bogus anyway, because the
 actual working set will be N backends' rings not just one.

I don't follow. Ideally, in the sync-scan case, the sets of buffers in
the ring of different scans on the same relation will overlap
completely, right?

We might not be at the ideal, but the sets of buffers in the rings
shouldn't be disjoint, should they?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Seq scans status update

2007-05-30 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 On Wed, 2007-05-30 at 15:56 -0400, Tom Lane wrote:
 In the sync-scan case the idea seems pretty bogus anyway, because the
 actual working set will be N backends' rings not just one.

 I don't follow. Ideally, in the sync-scan case, the sets of buffers in
 the ring of different scans on the same relation will overlap
 completely, right?

 We might not be at the ideal, but the sets of buffers in the rings
 shouldn't be disjoint, should they?

According to Heikki's explanation here
http://archives.postgresql.org/pgsql-patches/2007-05/msg00498.php
each backend doing a heapscan would collect its own ring of buffers.
You might have a few backends that are always followers, never leaders,
and so never actually fetch any pages --- but for each backend that
actually did any I/O there would be a separate ring.  In practice I'd
expect the lead would change hands pretty often and so you'd see all
the backends accumulating their own rings.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Seq scans status update

2007-05-29 Thread Alvaro Herrera
Tom Lane wrote:
 Gregory Stark [EMAIL PROTECTED] writes:
  Is there a reason UnpinBuffer has to be the one to increment the usage count
  anyways? Why can't ReadBuffer handle incrementing the count and just trust
  that it won't be decremented until the buffer is unpinned anyways?
 
 That's a good question.  I think the idea was that if we hold a buffer
 pinned for awhile (long enough that the bgwriter's clock sweep passes
 over it one or more times), we want the usage count decrementing to
 start when we release the pin, not when we acquire it.  But maybe that
 could be fixed if the clock sweep doesn't touch the usage_count of a
 pinned buffer.  Which in fact it may not do already --- didn't look.

It does -- in BgBufferSync the all scan calls SyncOneBuffer with
skip_pinned=false.  The lru scan does skip pinned buffers.

-- 
Alvaro Herrera  Developer, http://www.PostgreSQL.org/
World domination is proceeding according to plan(Andrew Morton)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Seq scans status update

2007-05-29 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:

Is there a reason UnpinBuffer has to be the one to increment the usage count
anyways? Why can't ReadBuffer handle incrementing the count and just trust
that it won't be decremented until the buffer is unpinned anyways?

That's a good question.  I think the idea was that if we hold a buffer
pinned for awhile (long enough that the bgwriter's clock sweep passes
over it one or more times), we want the usage count decrementing to
start when we release the pin, not when we acquire it.  But maybe that
could be fixed if the clock sweep doesn't touch the usage_count of a
pinned buffer.  Which in fact it may not do already --- didn't look.


It does -- in BgBufferSync the all scan calls SyncOneBuffer with
skip_pinned=false.  The lru scan does skip pinned buffers.


You're looking at the wrong place. StrategyGetBuffer drives the clock 
sweep, and it always decreases the usage_count, IOW it doesn't skip 
pinned buffers. SyncOneBuffer and BgBufferSync don't decrease the 
usage_count in any case.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Here's a new version, all known issues are now fixed. I'm now happy with 
 this patch.

I'm looking this over and finding it fairly ugly from a
system-structural point of view.  In particular, this has pushed the
single-global-variable StrategyHintVacuum() concept well past the
breaking point.  That API was sort of tolerable for vacuum, since vacuum
isn't re-entrant and can't coexist with any other behavior within the
same backend; though I never liked the fact that it affected vacuum's
system-catalog accesses along with the intended table fetches.  But now
you've got bits of code hacking the access pattern in contexts that are
far less predictable than vacuum ... and not guaranteeing to reset the
access pattern on failure, either.

I think we've got to get rid of the global variable and make the access
pattern be a parameter to StrategyGetBuffer, instead.  Which in turn
suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the
next level up.  This would serve directly for the heapscan case, and
for the main heap accesses in vacuum/analyze, but it might be a bit
of a pain to make index vacuuming pay attention to the strategy.
The other case that the patch addresses is COPY TO REL, which we could
handle if we were willing to pass a strategy parameter down through
heap_insert and RelationGetBufferForTuple ... is it worth the trouble?
I don't recall anyone having presented measurements to show COPY TO REL
being helped by this patch.

I don't especially care for the ring data structure being global inside
freelist.c, either.  What I'm inclined to do is have the access
strategy parameter actually be a pointer to some struct type, with NULL
representing the default strategy.  At the beginning of a bulk operation
we'd create an access strategy object, which would be internally the
current Ring data structure, but it'd be opaque to code outside
freelist.c.  This'd mean that a query using two large seqscans would use
two rings not one, but that's not necessarily bad; it would in any case
make it a lot easier to generalize the strategy concept in future to
support more than one kind of non-default policy being active in
different parts of a query.

A point I have not figured out how to deal with is that in the patch as
given, UnpinBuffer needs to know the strategy; and getting it that info
would make the changes far more invasive.  But the patch's behavior here
is pretty risky anyway, since the strategy global at the time of unpin
might have nothing to do with how it was set when the buffer was
acquired.  What I'm tempted to do is remove the special case there and
adjust buffer acquisition instead (maybe make it decrement the
usage_count when re-using a buffer from the ring).

Comments?  I'm still playing with the ideas at this point...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 A point I have not figured out how to deal with is that in the patch as
 given, UnpinBuffer needs to know the strategy; and getting it that info
 would make the changes far more invasive.  But the patch's behavior here
 is pretty risky anyway, since the strategy global at the time of unpin
 might have nothing to do with how it was set when the buffer was
 acquired.

 It's assumed that the same strategy is used when unpinning, which is 
 true for the current usage (and apparently needs to be documented).

I don't believe that for a moment.  Even in the trivial heapscan case,
the last pin is typically dropped during a tupleslot clear operation
sometime after the scan itself has moved on to the next page.  In a more
general context (such as parallel heapscan and indexscan on a rel) it's
certainly too risky to assume it.

 Normally buffers that are in the ring are recycled quickly, in which 
 case the usage_count will always be increased from 0-1 in UnpinBuffer, 
 regardless of the check. The check is there to avoid inflating the 
 usage_count on buffers that happened to be already in shared_buffers.

A heapscan would pin the buffer only once and hence bump its count at
most once, so I don't see a big problem here.  Also, I'd argue that
buffers that had a positive usage_count shouldn't get sucked into the
ring to begin with.

 If we figure out another way to deal with the COPY usage_count, maybe we 
 could remove the check altogether. I've been thinking of changing COPY 
 anyway so that it always kept the last page it inserted to pinned, to 
 avoid the traffic to buffer manager on each tuple.

This is getting fairly invasive for a part of the patch you've not
justified at all yet...

 Let me know if you want me to make changes and submit a new version.

I can work on this stuff; please do the tests to show whether it's worth
handling COPY TO REL, and keep on with Jeff's patch.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
It's assumed that the same strategy is used when unpinning, which is 
true for the current usage (and apparently needs to be documented).


I don't believe that for a moment.  Even in the trivial heapscan case,
the last pin is typically dropped during a tupleslot clear operation
sometime after the scan itself has moved on to the next page.  In a more
general context (such as parallel heapscan and indexscan on a rel) it's
certainly too risky to assume it.


Hmm, I guess you're right. :(

One idea is to keep track which pins are taken using the bulk strategy. 
It's a bit tricky when a buffer is pinned multiple times since we don't 
know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we 
could get away with just a flag per pinned buffer. Set the flag when a 
buffer is pinned with bulk strategy and it wasn't pinned by us before, 
and clear it when it's pinned with another strategy. I'm thinking we 
steal one bit from PrivateRefCount for this.


Normally buffers that are in the ring are recycled quickly, in which 
case the usage_count will always be increased from 0-1 in UnpinBuffer, 
regardless of the check. The check is there to avoid inflating the 
usage_count on buffers that happened to be already in shared_buffers.


A heapscan would pin the buffer only once and hence bump its count at
most once, so I don't see a big problem here.  Also, I'd argue that
buffers that had a positive usage_count shouldn't get sucked into the
ring to begin with.


True, except that with the synchronized scans patch two synchronized 
scans will pin the buffer twice.


If we figure out another way to deal with the COPY usage_count, maybe we 
could remove the check altogether. I've been thinking of changing COPY 
anyway so that it always kept the last page it inserted to pinned, to 
avoid the traffic to buffer manager on each tuple.


This is getting fairly invasive for a part of the patch you've not
justified at all yet...


Let me know if you want me to make changes and submit a new version.


I can work on this stuff; please do the tests to show whether it's worth
handling COPY TO REL, and keep on with Jeff's patch.


Ok.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 One idea is to keep track which pins are taken using the bulk strategy. 
 It's a bit tricky when a buffer is pinned multiple times since we don't 
 know which ReleaseBuffer corresponds which ReadBuffer, but perhaps we 
 could get away with just a flag per pinned buffer. Set the flag when a 
 buffer is pinned with bulk strategy and it wasn't pinned by us before, 
 and clear it when it's pinned with another strategy. I'm thinking we 
 steal one bit from PrivateRefCount for this.

Seems like a mess.  Why don't we just fix it so there's no need for
different behavior at Unpin time?  The facts on the ground are that
the current patch's change in UnpinBuffer is a no-op anyway, because
of the tupletableslot interference.

The behavior I'm imagining is just that when we try to take a buffer
from the ring, if its usage count exceeds 1 then drop it from the ring
and get another buffer.  1 would be the expected case if no one had
touched it since we last used it.

 A heapscan would pin the buffer only once and hence bump its count at
 most once, so I don't see a big problem here.  Also, I'd argue that
 buffers that had a positive usage_count shouldn't get sucked into the
 ring to begin with.

 True, except that with the synchronized scans patch two synchronized 
 scans will pin the buffer twice.

Hmm.  But we probably don't want the same buffer in two different
backends' rings, either.  You *sure* the sync-scan patch has no
interaction with this one?

One other question: I see the patch sets the threshold for switching
from normal to ring-buffer heapscans at table size = NBuffers.  Why
so high?  I'd have expected maybe at most NBuffers/4 or NBuffers/10.
If you don't want a seqscan blowing out your buffer cache, you surely
don't want it blowing out 90% of the cache either.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 A point I have not figured out how to deal with is that in the patch as
 given, UnpinBuffer needs to know the strategy; and getting it that info
 would make the changes far more invasive.  But the patch's behavior here
 is pretty risky anyway, since the strategy global at the time of unpin
 might have nothing to do with how it was set when the buffer was
 acquired.  What I'm tempted to do is remove the special case there and
 adjust buffer acquisition instead (maybe make it decrement the
 usage_count when re-using a buffer from the ring).

Is there a reason UnpinBuffer has to be the one to increment the usage count
anyways? Why can't ReadBuffer handle incrementing the count and just trust
that it won't be decremented until the buffer is unpinned anyways?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-28 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Is there a reason UnpinBuffer has to be the one to increment the usage count
 anyways? Why can't ReadBuffer handle incrementing the count and just trust
 that it won't be decremented until the buffer is unpinned anyways?

That's a good question.  I think the idea was that if we hold a buffer
pinned for awhile (long enough that the bgwriter's clock sweep passes
over it one or more times), we want the usage count decrementing to
start when we release the pin, not when we acquire it.  But maybe that
could be fixed if the clock sweep doesn't touch the usage_count of a
pinned buffer.  Which in fact it may not do already --- didn't look.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Seq scans status update

2007-05-26 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Here's a new version, all known issues are now fixed. I'm now happy with 
this patch.
Next, I'll start looking at the latest version of Jeff's synchronized 
scans patch.


I'm a bit confused --- weren't you intending to review these in parallel
because of the possible interactions?  Do you think this should be
applied now, or does it need to wait to see what happens with Jeff's
patch?


I think it should be applied now. I've briefly looked at Jeff's patch 
and I don't see any problems looming there.


Jeff performed tests with Simon's original patch and his patch, and I 
think the results from those tests are still valid since the basic 
behavior hasn't been changed. I'll repeat those tests myself, and run 
some more to see if the added CPU overhead shows up in tests, but I'm 
pretty confident the patches will work together as advertised.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-25 Thread Heikki Linnakangas
Here's a new version, all known issues are now fixed. I'm now happy with 
this patch.


Next, I'll start looking at the latest version of Jeff's synchronized 
scans patch.


Bruce Momjian wrote:

Great.  Based on this, do you have a patch that is ready to apply.

---

Heikki Linnakangas wrote:

Heikki Linnakangas wrote:
In any case, I'd like to see more test results before we make a 
decision. I'm running tests with DBT-2 and a seq scan running in the 
background to see if the cache-spoiling effect shows up. I'm also trying 
to get hold of some bigger hardware to run on. Running these tests takes 
some calendar time, but the hard work has already been done. I'm going 
to start reviewing Jeff's synchronized scans patch now.

Here are the results of the DBT-2 tests:

http://community.enterprisedb.com/seqscan/imola/

In each of these tests, at the end of rampup a script is started that 
issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay 
between end of previous query and start of next one.


The patch makes the seq scans go significantly faster. In the 1 hour 
test period, the patched tests perform roughly 30-100% as many selects 
as unpatched tests.


With 100 and 105 warehouses, it also significantly reduces the impact of 
the seq scan on other queries; response times are lower with the patch. 
With 120 warehouses the reduction of impact is not as clear, but when 
you plot the response times it's still there (the plots on the response 
times charts-page are useless because they're overwhelmed by the 
checkpoint spike).


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/heapam.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.232
diff -c -r1.232 heapam.c
*** src/backend/access/heap/heapam.c	8 Apr 2007 01:26:27 -	1.232
--- src/backend/access/heap/heapam.c	25 May 2007 19:22:33 -
***
*** 83,88 
--- 83,96 
  	 */
  	scan-rs_nblocks = RelationGetNumberOfBlocks(scan-rs_rd);
  
+ 	/* A scan on a table smaller than shared_buffers is treated like random
+ 	 * access, but bigger scans use the bulk read page replacement policy.
+ 	 */
+ 	if (scan-rs_nblocks  NBuffers)
+ 		scan-rs_accesspattern = AP_BULKREAD;
+ 	else
+ 		scan-rs_accesspattern = AP_NORMAL;
+ 
  	scan-rs_inited = false;
  	scan-rs_ctup.t_data = NULL;
  	ItemPointerSetInvalid(scan-rs_ctup.t_self);
***
*** 123,133 
--- 131,146 
  
  	Assert(page  scan-rs_nblocks);
  
+ 	/* Read the page with the right strategy */
+ 	SetAccessPattern(scan-rs_accesspattern);
+ 
  	scan-rs_cbuf = ReleaseAndReadBuffer(scan-rs_cbuf,
  		 scan-rs_rd,
  		 page);
  	scan-rs_cblock = page;
  
+ 	SetAccessPattern(AP_NORMAL);
+ 
  	if (!scan-rs_pageatatime)
  		return;
  
Index: src/backend/access/transam/xlog.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.268
diff -c -r1.268 xlog.c
*** src/backend/access/transam/xlog.c	30 Apr 2007 21:01:52 -	1.268
--- src/backend/access/transam/xlog.c	15 May 2007 16:23:30 -
***
*** 1668,1673 
--- 1668,1700 
  }
  
  /*
+  * Returns true if 'record' hasn't been flushed to disk yet.
+  */
+ bool
+ XLogNeedsFlush(XLogRecPtr record)
+ {
+ 	/* Quick exit if already known flushed */
+ 	if (XLByteLE(record, LogwrtResult.Flush))
+ 		return false;
+ 
+ 	/* read LogwrtResult and update local state */
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(xlogctl-info_lck);
+ 		LogwrtResult = xlogctl-LogwrtResult;
+ 		SpinLockRelease(xlogctl-info_lck);
+ 	}
+ 
+ 	/* check again */
+ 	if (XLByteLE(record, LogwrtResult.Flush))
+ 		return false;
+ 
+ 	return true;
+ }
+ 
+ /*
   * Ensure that all XLOG data through the given position is flushed to disk.
   *
   * NOTE: this differs from XLogWrite mainly in that the WALWriteLock is not
Index: src/backend/commands/copy.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.283
diff -c -r1.283 copy.c
*** src/backend/commands/copy.c	27 Apr 2007 22:05:46 -	1.283
--- src/backend/commands/copy.c	15 May 2007 17:05:29 -
***
*** 1876,1881 
--- 1876,1888 
  	nfields = file_has_oids ? (attr_count + 1) : attr_count;
  	field_strings = (char **) palloc(nfields * sizeof(char *));
  
+ 	/* Use the special COPY buffer replacement 

Re: [PATCHES] Seq scans status update

2007-05-25 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Here's a new version, all known issues are now fixed. I'm now happy with 
 this patch.
 Next, I'll start looking at the latest version of Jeff's synchronized 
 scans patch.

I'm a bit confused --- weren't you intending to review these in parallel
because of the possible interactions?  Do you think this should be
applied now, or does it need to wait to see what happens with Jeff's
patch?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Seq scans status update

2007-05-22 Thread Bruce Momjian

Great.  Based on this, do you have a patch that is ready to apply.

---

Heikki Linnakangas wrote:
 Heikki Linnakangas wrote:
  In any case, I'd like to see more test results before we make a 
  decision. I'm running tests with DBT-2 and a seq scan running in the 
  background to see if the cache-spoiling effect shows up. I'm also trying 
  to get hold of some bigger hardware to run on. Running these tests takes 
  some calendar time, but the hard work has already been done. I'm going 
  to start reviewing Jeff's synchronized scans patch now.
 
 Here are the results of the DBT-2 tests:
 
 http://community.enterprisedb.com/seqscan/imola/
 
 In each of these tests, at the end of rampup a script is started that 
 issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay 
 between end of previous query and start of next one.
 
 The patch makes the seq scans go significantly faster. In the 1 hour 
 test period, the patched tests perform roughly 30-100% as many selects 
 as unpatched tests.
 
 With 100 and 105 warehouses, it also significantly reduces the impact of 
 the seq scan on other queries; response times are lower with the patch. 
 With 120 warehouses the reduction of impact is not as clear, but when 
 you plot the response times it's still there (the plots on the response 
 times charts-page are useless because they're overwhelmed by the 
 checkpoint spike).
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Seq scans status update

2007-05-22 Thread Heikki Linnakangas

Not yet, there's still one issue that needs fixing.

Bruce Momjian wrote:

Great.  Based on this, do you have a patch that is ready to apply.

---

Heikki Linnakangas wrote:

Heikki Linnakangas wrote:
In any case, I'd like to see more test results before we make a 
decision. I'm running tests with DBT-2 and a seq scan running in the 
background to see if the cache-spoiling effect shows up. I'm also trying 
to get hold of some bigger hardware to run on. Running these tests takes 
some calendar time, but the hard work has already been done. I'm going 
to start reviewing Jeff's synchronized scans patch now.

Here are the results of the DBT-2 tests:

http://community.enterprisedb.com/seqscan/imola/

In each of these tests, at the end of rampup a script is started that 
issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay 
between end of previous query and start of next one.


The patch makes the seq scans go significantly faster. In the 1 hour 
test period, the patched tests perform roughly 30-100% as many selects 
as unpatched tests.


With 100 and 105 warehouses, it also significantly reduces the impact of 
the seq scan on other queries; response times are lower with the patch. 
With 120 warehouses the reduction of impact is not as clear, but when 
you plot the response times it's still there (the plots on the response 
times charts-page are useless because they're overwhelmed by the 
checkpoint spike).


--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Seq scans status update

2007-05-21 Thread Heikki Linnakangas

I forgot to attach the program used to generate test data. Here it is.

Heikki Linnakangas wrote:
Attached is a new version of Simon's scan-resistant buffer manager 
patch. It's not ready for committing yet because of a small issue I 
found this morning (* see bottom), but here's a status update.


To recap, the basic idea is to use a small ring of buffers for large 
scans like VACUUM, COPY and seq-scans. Changes to the original patch:


- a different sized ring is used for VACUUMs and seq-scans, and COPY. 
VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 
buffers in default configuration. See README changes in the patch for 
rationale.


- for queries with large seqscans, the buffer ring is only used for 
reads issued by the seq scan, not for any other reads in the query. 
Typical scenario where this matters is doing a large seq scan with a 
nested loop join to a smaller table. You don't want to use the buffer 
ring for index lookups inside the nested loop.


- for seqscans, drop buffers from the ring that would need a WAL flush 
to reuse. That makes bulk updates to behave roughly like they do without 
the patch, instead of having to do a WAL flush every 32 pages.


I've spent a lot of time thinking of solutions to the last point. The 
obvious solution would be to not use the buffer ring for updating scans. 
The difficulty with that is that we don't know if a scan is read-only in 
heapam.c, where the hint to use the buffer ring is set.


I've completed a set of performance tests on a test server. The server 
has 4 GB of RAM, of which 1 GB is used for shared_buffers.


Results for a 10 GB table:

 head-copy-bigtable   | 00:10:09.07016
 head-copy-bigtable   | 00:10:20.507357
 head-copy-bigtable   | 00:10:21.857677
 head-copy_nowal-bigtable | 00:05:18.232956
 head-copy_nowal-bigtable | 00:03:24.109047
 head-copy_nowal-bigtable | 00:05:31.019643
 head-select-bigtable | 00:03:47.102731
 head-select-bigtable | 00:01:08.314719
 head-select-bigtable | 00:01:08.238509
 head-select-bigtable | 00:01:08.208563
 head-select-bigtable | 00:01:08.28347
 head-select-bigtable | 00:01:08.308671
 head-vacuum_clean-bigtable   | 00:01:04.227832
 head-vacuum_clean-bigtable   | 00:01:04.232258
 head-vacuum_clean-bigtable   | 00:01:04.294621
 head-vacuum_clean-bigtable   | 00:01:04.280677
 head-vacuum_hintbits-bigtable| 00:04:01.123924
 head-vacuum_hintbits-bigtable| 00:03:58.253175
 head-vacuum_hintbits-bigtable| 00:04:26.318159
 head-vacuum_hintbits-bigtable| 00:04:37.512965
 patched-copy-bigtable| 00:09:52.776754
 patched-copy-bigtable| 00:10:18.185826
 patched-copy-bigtable| 00:10:16.975482
 patched-copy_nowal-bigtable  | 00:03:14.882366
 patched-copy_nowal-bigtable  | 00:04:01.04648
 patched-copy_nowal-bigtable  | 00:03:56.062272
 patched-select-bigtable  | 00:03:47.704154
 patched-select-bigtable  | 00:01:08.460326
 patched-select-bigtable  | 00:01:10.441544
 patched-select-bigtable  | 00:01:11.916221
 patched-select-bigtable  | 00:01:13.848038
 patched-select-bigtable  | 00:01:10.956133
 patched-vacuum_clean-bigtable| 00:01:10.315439
 patched-vacuum_clean-bigtable| 00:01:12.210537
 patched-vacuum_clean-bigtable| 00:01:15.202114
 patched-vacuum_clean-bigtable| 00:01:10.712235
 patched-vacuum_hintbits-bigtable | 00:03:42.279201
 patched-vacuum_hintbits-bigtable | 00:04:02.057778
 patched-vacuum_hintbits-bigtable | 00:04:26.805822
 patched-vacuum_hintbits-bigtable | 00:04:28.911184

In other words, the patch has no significant effect, as expected. The 
select times did go up by a couple of seconds, which I didn't expect, 
though. One theory is that unused shared_buffers are swapped out during 
the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and 
try again at some point.


Results for a 2 GB table:

 copy-medsize-unpatched| 00:02:18.23246
 copy-medsize-unpatched| 00:02:22.347194
 copy-medsize-unpatched| 00:02:23.875874
 copy_nowal-medsize-unpatched  | 00:01:27.606334
 copy_nowal-medsize-unpatched  | 00:01:17.491243
 copy_nowal-medsize-unpatched  | 00:01:31.902719
 select-medsize-unpatched  | 00:00:03.786031
 select-medsize-unpatched  | 00:00:02.678069
 select-medsize-unpatched  | 00:00:02.666103
 select-medsize-unpatched  | 00:00:02.673494
 select-medsize-unpatched  | 00:00:02.669645
 select-medsize-unpatched  | 00:00:02.666278
 vacuum_clean-medsize-unpatched| 00:00:01.091356
 vacuum_clean-medsize-unpatched| 00:00:01.923138
 vacuum_clean-medsize-unpatched| 00:00:01.917213
 vacuum_clean-medsize-unpatched| 00:00:01.917333
 vacuum_hintbits-medsize-unpatched | 00:00:01.683718
 

Re: [PATCHES] Seq scans status update

2007-05-20 Thread Heikki Linnakangas

Heikki Linnakangas wrote:
In any case, I'd like to see more test results before we make a 
decision. I'm running tests with DBT-2 and a seq scan running in the 
background to see if the cache-spoiling effect shows up. I'm also trying 
to get hold of some bigger hardware to run on. Running these tests takes 
some calendar time, but the hard work has already been done. I'm going 
to start reviewing Jeff's synchronized scans patch now.


Here are the results of the DBT-2 tests:

http://community.enterprisedb.com/seqscan/imola/

In each of these tests, at the end of rampup a script is started that 
issues a SELECT COUNT(*) FROM stock in a loop, with 2 minute delay 
between end of previous query and start of next one.


The patch makes the seq scans go significantly faster. In the 1 hour 
test period, the patched tests perform roughly 30-100% as many selects 
as unpatched tests.


With 100 and 105 warehouses, it also significantly reduces the impact of 
the seq scan on other queries; response times are lower with the patch. 
With 120 warehouses the reduction of impact is not as clear, but when 
you plot the response times it's still there (the plots on the response 
times charts-page are useless because they're overwhelmed by the 
checkpoint spike).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Seq scans status update

2007-05-18 Thread Heikki Linnakangas

CK Tan wrote:
If it is convenient for you, could you run my patch against the same 
hardware and data to get some numbers on select for comparison? Although 
we don't address updates, copy, or inserts,  we are definitely getting 
at least 20% improvement in scans here without poisoning the bufpool for 
large tables.


Sure, here you are:

 copy-cktan-big| 00:10:08.843126
 copy-cktan-big| 00:10:17.767606
 copy-cktan-big| 00:09:37.797059
 copy_nowal-cktan-big  | 00:05:12.305025
 copy_nowal-cktan-big  | 00:05:10.518417
 copy_nowal-cktan-big  | 00:05:03.472939
 select-cktan-big  | 00:03:27.655982
 select-cktan-big  | 00:01:55.496408
 select-cktan-big  | 00:01:31.693856
 select-cktan-big  | 00:01:12.705268
 select-cktan-big  | 00:01:12.478247
 select-cktan-big  | 00:01:10.866484
 vacuum_clean-cktan-big| 00:03:05.340875
 vacuum_clean-cktan-big| 00:01:12.428317
 vacuum_clean-cktan-big| 00:01:13.179957
 vacuum_clean-cktan-big| 00:01:10.43
 vacuum_hintbits-cktan-big | 00:03:58.78208
 vacuum_hintbits-cktan-big | 00:04:02.515778
 vacuum_hintbits-cktan-big | 00:04:19.083402
 vacuum_hintbits-cktan-big | 00:04:11.170831
 copy-cktan-med| 00:02:19.413484
 copy-cktan-med| 00:02:22.270497
 copy-cktan-med| 00:02:22.297946
 copy_nowal-cktan-med  | 00:01:31.192601
 copy_nowal-cktan-med  | 00:01:17.736356
 copy_nowal-cktan-med  | 00:01:32.272778
 select-cktan-med  | 00:00:03.774974
 select-cktan-med  | 00:00:01.279276
 select-cktan-med  | 00:00:01.297703
 select-cktan-med  | 00:00:01.304129
 select-cktan-med  | 00:00:01.297048
 select-cktan-med  | 00:00:01.306073
 vacuum_clean-cktan-med| 00:00:01.820733
 vacuum_clean-cktan-med| 00:00:01.755684
 vacuum_clean-cktan-med| 00:00:01.755659
 vacuum_clean-cktan-med| 00:00:01.750972
 vacuum_hintbits-cktan-med | 00:00:01.58744
 vacuum_hintbits-cktan-med | 00:00:06.216038
 vacuum_hintbits-cktan-med | 00:00:02.201789
 vacuum_hintbits-cktan-med | 00:00:36.494427

Select performance looks the same as with Simon's/my patch.

That 20% gain must be hardware-dependent. My interpretation is that the 
CPU savings from more efficient cache usage only matters when the I/O 
bandwidth is high enough that CPU becomes the bottleneck.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Seq scans status update

2007-05-18 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
In any case, I do want this for VACUUMs to fix the WAL flush for every 
dirty page problem. Maybe we should indeed drop the other aspects of 
the patch and move on, I'm getting tired of this as well.


Can we devise a small patch that fixes that issue without creating
uncertain impacts on other behavior?


Yeah certainly, that's my backup plan.


The thing that has made me uncomfortable with this set of patches right
along is the presumption that it's a good idea to tweak cache behavior
to improve a small set of cases.  We are using cache behavior (now
clock-sweep, formerly LRU) that is well tested and proven across a large
body of experience; my gut feeling is that deviating from that behavior
is likely to be a net loss when the big picture is considered.  I
certainly have not seen any test results that try to prove that there's
no such generalized disadvantage to these patches.


That was my initial reaction as well. However, people claimed that using 
a small number of buffers you can achieve higher raw throughput. 
Avoiding the cache spoiling sounds plausible as well, if you're going to 
do a seq scan of a table larger than shared_buffers, you know those 
pages are not going to be needed in the near future; you're going to 
replace them yourself with pages from the same scan.



One could argue that the real bug here is that VACUUM deviates from the
standard behavior by forcing immediate recycling of pages it releases,
and that getting rid of that wart is the correct answer rather than
piling more warts atop it --- especially warts that will change the
behavior for anything besides VACUUM.


Maybe a better approach would be switching to an algorithm that's more 
resistant to sequential scans by nature. That's definitely not going to 
happen for 8.3, though.


It's also possible that whatever algorithm we choose doesn't make much 
difference in practice because the in typical configurations the OS 
cache is bigger and more significant anyway. It's also possible that 
there's some counter-productive interactions between our and OS cache 
management.


In any case, I'd like to see more test results before we make a 
decision. I'm running tests with DBT-2 and a seq scan running in the 
background to see if the cache-spoiling effect shows up. I'm also trying 
to get hold of some bigger hardware to run on. Running these tests takes 
some calendar time, but the hard work has already been done. I'm going 
to start reviewing Jeff's synchronized scans patch now.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PATCHES] Seq scans status update

2007-05-17 Thread Heikki Linnakangas
Attached is a new version of Simon's scan-resistant buffer manager 
patch. It's not ready for committing yet because of a small issue I 
found this morning (* see bottom), but here's a status update.


To recap, the basic idea is to use a small ring of buffers for large 
scans like VACUUM, COPY and seq-scans. Changes to the original patch:


- a different sized ring is used for VACUUMs and seq-scans, and COPY. 
VACUUM and COPY use a ring of 32 buffers, and COPY uses a ring of 4096 
buffers in default configuration. See README changes in the patch for 
rationale.


- for queries with large seqscans, the buffer ring is only used for 
reads issued by the seq scan, not for any other reads in the query. 
Typical scenario where this matters is doing a large seq scan with a 
nested loop join to a smaller table. You don't want to use the buffer 
ring for index lookups inside the nested loop.


- for seqscans, drop buffers from the ring that would need a WAL flush 
to reuse. That makes bulk updates to behave roughly like they do without 
the patch, instead of having to do a WAL flush every 32 pages.


I've spent a lot of time thinking of solutions to the last point. The 
obvious solution would be to not use the buffer ring for updating scans. 
The difficulty with that is that we don't know if a scan is read-only in 
heapam.c, where the hint to use the buffer ring is set.


I've completed a set of performance tests on a test server. The server 
has 4 GB of RAM, of which 1 GB is used for shared_buffers.


Results for a 10 GB table:

 head-copy-bigtable   | 00:10:09.07016
 head-copy-bigtable   | 00:10:20.507357
 head-copy-bigtable   | 00:10:21.857677
 head-copy_nowal-bigtable | 00:05:18.232956
 head-copy_nowal-bigtable | 00:03:24.109047
 head-copy_nowal-bigtable | 00:05:31.019643
 head-select-bigtable | 00:03:47.102731
 head-select-bigtable | 00:01:08.314719
 head-select-bigtable | 00:01:08.238509
 head-select-bigtable | 00:01:08.208563
 head-select-bigtable | 00:01:08.28347
 head-select-bigtable | 00:01:08.308671
 head-vacuum_clean-bigtable   | 00:01:04.227832
 head-vacuum_clean-bigtable   | 00:01:04.232258
 head-vacuum_clean-bigtable   | 00:01:04.294621
 head-vacuum_clean-bigtable   | 00:01:04.280677
 head-vacuum_hintbits-bigtable| 00:04:01.123924
 head-vacuum_hintbits-bigtable| 00:03:58.253175
 head-vacuum_hintbits-bigtable| 00:04:26.318159
 head-vacuum_hintbits-bigtable| 00:04:37.512965
 patched-copy-bigtable| 00:09:52.776754
 patched-copy-bigtable| 00:10:18.185826
 patched-copy-bigtable| 00:10:16.975482
 patched-copy_nowal-bigtable  | 00:03:14.882366
 patched-copy_nowal-bigtable  | 00:04:01.04648
 patched-copy_nowal-bigtable  | 00:03:56.062272
 patched-select-bigtable  | 00:03:47.704154
 patched-select-bigtable  | 00:01:08.460326
 patched-select-bigtable  | 00:01:10.441544
 patched-select-bigtable  | 00:01:11.916221
 patched-select-bigtable  | 00:01:13.848038
 patched-select-bigtable  | 00:01:10.956133
 patched-vacuum_clean-bigtable| 00:01:10.315439
 patched-vacuum_clean-bigtable| 00:01:12.210537
 patched-vacuum_clean-bigtable| 00:01:15.202114
 patched-vacuum_clean-bigtable| 00:01:10.712235
 patched-vacuum_hintbits-bigtable | 00:03:42.279201
 patched-vacuum_hintbits-bigtable | 00:04:02.057778
 patched-vacuum_hintbits-bigtable | 00:04:26.805822
 patched-vacuum_hintbits-bigtable | 00:04:28.911184

In other words, the patch has no significant effect, as expected. The 
select times did go up by a couple of seconds, which I didn't expect, 
though. One theory is that unused shared_buffers are swapped out during 
the tests, and bgwriter pulls them back in. I'll set swappiness to 0 and 
try again at some point.


Results for a 2 GB table:

 copy-medsize-unpatched| 00:02:18.23246
 copy-medsize-unpatched| 00:02:22.347194
 copy-medsize-unpatched| 00:02:23.875874
 copy_nowal-medsize-unpatched  | 00:01:27.606334
 copy_nowal-medsize-unpatched  | 00:01:17.491243
 copy_nowal-medsize-unpatched  | 00:01:31.902719
 select-medsize-unpatched  | 00:00:03.786031
 select-medsize-unpatched  | 00:00:02.678069
 select-medsize-unpatched  | 00:00:02.666103
 select-medsize-unpatched  | 00:00:02.673494
 select-medsize-unpatched  | 00:00:02.669645
 select-medsize-unpatched  | 00:00:02.666278
 vacuum_clean-medsize-unpatched| 00:00:01.091356
 vacuum_clean-medsize-unpatched| 00:00:01.923138
 vacuum_clean-medsize-unpatched| 00:00:01.917213
 vacuum_clean-medsize-unpatched| 00:00:01.917333
 vacuum_hintbits-medsize-unpatched | 00:00:01.683718
 vacuum_hintbits-medsize-unpatched | 00:00:01.864003
 vacuum_hintbits-medsize-unpatched | 00:00:03.186596
 

Re: [PATCHES] Seq scans status update

2007-05-17 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I've completed a set of performance tests on a test server. The server 
 has 4 GB of RAM, of which 1 GB is used for shared_buffers.

Perhaps I'm misreading it, but these tests seem to show no improvement
worth spending any effort on --- some of the tests improve a bit but
many get worse, and that's for tests allegedly designed to highlight the
improvement; there's been no attempt to measure the distributed cost of
the additional logic in scenarios where it's not helpful.  To say
nothing of the likelihood that it will be flat-out counterproductive
in yet other scenarios.

regression=# select id,sum(rt),count(*) from times group by id;
 id |   sum   | count 
+-+---
 10G| 01:15:53.497114 |20
 10Gpatched | 01:12:51.749906 |20
 2G | 00:11:54.343741 |20
 2Gpatched  | 00:11:32.482733 |20
(4 rows)

Should we not just reject the patch and move on to something more
useful?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Seq scans status update

2007-05-17 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I've completed a set of performance tests on a test server. The server 
has 4 GB of RAM, of which 1 GB is used for shared_buffers.


Perhaps I'm misreading it, but these tests seem to show no improvement
worth spending any effort on --- some of the tests improve a bit but
many get worse, and that's for tests allegedly designed to highlight the
improvement; there's been no attempt to measure the distributed cost of
the additional logic in scenarios where it's not helpful.  To say
nothing of the likelihood that it will be flat-out counterproductive
in yet other scenarios.


Yeah, possibly. I've been thinking hard of scenarios where this would be 
counterproductive, bulk delete is one that the original patch hurt, but 
I think I have all interesting scenarios covered now.



Should we not just reject the patch and move on to something more
useful?


If Luke is right, the L2 cache effect that's visible in these in-memory 
tests:


 select-medsize-patched| 00:00:01.332975
 select-medsize-patched| 00:00:01.33014
 select-medsize-patched| 00:00:01.332392
 select-medsize-patched| 00:00:01.333498
 select-medsize-patched| 00:00:01.332692
 select-medsize-unpatched  | 00:00:02.678069
 select-medsize-unpatched  | 00:00:02.666103
 select-medsize-unpatched  | 00:00:02.673494
 select-medsize-unpatched  | 00:00:02.669645
 select-medsize-unpatched  | 00:00:02.666278

is also visible on larger scans that don't fit in cache with bigger I/O 
hardware, and this patch would increase the max. I/O throughput that we 
can handle on such hardware. I don't have such hardware available, I 
hope someone else will try that.


In addition to that, this should reduce the cache-spoiling effect of big 
scans and COPY. That's harder to quantify, I've been planning to run a 
TPC-C test with a large SELECT COUNT(*) running in the background to 
measure that, but me and the server has been busy with those other tests.


In any case, I do want this for VACUUMs to fix the WAL flush for every 
dirty page problem. Maybe we should indeed drop the other aspects of 
the patch and move on, I'm getting tired of this as well.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Seq scans status update

2007-05-17 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 In any case, I do want this for VACUUMs to fix the WAL flush for every 
 dirty page problem. Maybe we should indeed drop the other aspects of 
 the patch and move on, I'm getting tired of this as well.

Can we devise a small patch that fixes that issue without creating
uncertain impacts on other behavior?

The thing that has made me uncomfortable with this set of patches right
along is the presumption that it's a good idea to tweak cache behavior
to improve a small set of cases.  We are using cache behavior (now
clock-sweep, formerly LRU) that is well tested and proven across a large
body of experience; my gut feeling is that deviating from that behavior
is likely to be a net loss when the big picture is considered.  I
certainly have not seen any test results that try to prove that there's
no such generalized disadvantage to these patches.

One could argue that the real bug here is that VACUUM deviates from the
standard behavior by forcing immediate recycling of pages it releases,
and that getting rid of that wart is the correct answer rather than
piling more warts atop it --- especially warts that will change the
behavior for anything besides VACUUM.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Seq scans status update

2007-05-17 Thread Luke Lonergan
Hi Heikki,

On 5/17/07 10:28 AM, Heikki Linnakangas [EMAIL PROTECTED] wrote:

 is also visible on larger scans that don't fit in cache with bigger I/O
 hardware, and this patch would increase the max. I/O throughput that we
 can handle on such hardware. I don't have such hardware available, I
 hope someone else will try that.

Yes, this is absolutely the case, in addition to the benefits of not
polluting the bufcache with seq scans (as discussed in detail previously).
We've adopted this (see CK's patch) with excellent benefits.

We can try your version on a machine with fast I/O and get back to you with
a comparison of this and CK's version.

- Luke



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings