Re: [PATCHES] Synchronized scans

2007-06-10 Thread Jeff Davis
On Sat, 2007-06-09 at 09:58 -0400, Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
   * For a large table, do lazy_scan_heap, scan_heap, and a sequential
  scan usually progress at approximately the same rate?
 
 scan_heap would probably be faster than a regular seqscan, since it
 isn't doing any where-clause-checking or data output.  Except if you've
 got vacuum-cost-limit enabled, which I think is likely to be true by
 default in future.  Another problem is that lazy_scan_heap stops every
 so often to make a pass over the table's indexes, which'd certainly
 cause it to fall out of sync with more typical seqscans.

I think that these problems are significant enough that I'm not sure
sync-scanning a VACUUM is the right way to approach the problem.

Maybe a better solution would be to try to get a sequential scan to do
some of the work required by a VACUUM. I don't think we can stop in the
middle of a sequential scan to vacuum the indexes, but perhaps we could
come up with some kind of scheme. It would be cheaper (perhaps) to spill
the list of deletable TIDs to disk than to rescan a big (mostly live)
table later. And if it was costly, we wouldn't need to do the scan part
of a VACUUM on every sequential scan.

I'm sure this has been brought up before, does someone have a pointer to
a discussion about doing VACUUM-like work in a sequential scan?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-10 Thread Heikki Linnakangas

Tom Lane wrote:

Jeff Davis [EMAIL PROTECTED] writes:

 * Just adding in the syncscan to scan_heap and lazy_scan_heap seems
very easy at first thought. Are there any complications that I'm
missing?


I believe there are assumptions buried in both full and lazy vacuum that
blocks are scanned in increasing order.  Not sure how hard that would be
to fix or work around.  The only one I can specifically recall in lazy
vacuum is that we assume the list of deletable TIDs is sorted a priori.
Possibly you could deal with that by forcing an index-vacuum pass at the
instant where the scan would wrap around, so that the list could be
cleared before putting any lower-numbered blocks into it.


In this case, we're still scanning the table in increasing order, the 
zero-point is just shifted. We can still do a binary search if we do it 
in a whacky module-arithmetic fashion.


I believe TID list ordering is the only reason why we need to scan in order.

I don't think sync-scanning vacuum is worth pursuing, though, because of 
the other issues: index scans, vacuum cost accounting, and the fact that 
the 2nd pass would be harder to synchronize. There's a lot of other 
interesting ideas for vacuum that are more generally applicable.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-10 Thread Gregory Stark
Heikki Linnakangas [EMAIL PROTECTED] writes:

 I don't think sync-scanning vacuum is worth pursuing, though, because of the
 other issues: index scans, vacuum cost accounting, and the fact that the 2nd
 pass would be harder to synchronize. There's a lot of other interesting ideas
 for vacuum that are more generally applicable.

I think we could probably arrange for vacuum to synchronize. If there's one
sequential scan running we have to imagine there are others coming along soon
too. so if we become desynchronized we'll just coerce the next one to start
where we want and follow it for a while.

However I have a another worry. Even if we did manage to get vacuum
synchronizing well what would it do to the sequential scan performance.
Instead of zipping along reading clean blocks into its small ring buffer and
discarding them when it's done it'll suddenly find many of its blocks dirty
when it goes to reuse them. Effectively we'll have just reinvented the problem
we had with vacuum previously albeit in a way which only hits sequential scans
particularly hard.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-10 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 I'm sure this has been brought up before, does someone have a pointer to
 a discussion about doing VACUUM-like work in a sequential scan?

Yeah, it's been discussed before; try looking for incremental vacuum
and such phrases.

The main stumbling block is cleaning out index entries for the
known-dead heap tuple.  The current VACUUM design amortizes that cost
across as many dead heap tuples as it can manage; doing it retail seems
inevitably to be a lot more expensive.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-10 Thread Alvaro Herrera
Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
  I'm sure this has been brought up before, does someone have a pointer to
  a discussion about doing VACUUM-like work in a sequential scan?
 
 Yeah, it's been discussed before; try looking for incremental vacuum
 and such phrases.
 
 The main stumbling block is cleaning out index entries for the
 known-dead heap tuple.  The current VACUUM design amortizes that cost
 across as many dead heap tuples as it can manage; doing it retail seems
 inevitably to be a lot more expensive.

Maybe what we could do is have a seqscan save known-dead tuple IDs in a
file, and then in a different operation (initiated by autovacuum) we
would remove those TIDs from indexes, before the regular heap scan.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-09 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
  * For a large table, do lazy_scan_heap, scan_heap, and a sequential
 scan usually progress at approximately the same rate?

scan_heap would probably be faster than a regular seqscan, since it
isn't doing any where-clause-checking or data output.  Except if you've
got vacuum-cost-limit enabled, which I think is likely to be true by
default in future.  Another problem is that lazy_scan_heap stops every
so often to make a pass over the table's indexes, which'd certainly
cause it to fall out of sync with more typical seqscans.

The vacuum-cost-limit issue may be sufficient reason to kill this idea;
not sure.

  * Just adding in the syncscan to scan_heap and lazy_scan_heap seems
 very easy at first thought. Are there any complications that I'm
 missing?

I believe there are assumptions buried in both full and lazy vacuum that
blocks are scanned in increasing order.  Not sure how hard that would be
to fix or work around.  The only one I can specifically recall in lazy
vacuum is that we assume the list of deletable TIDs is sorted a priori.
Possibly you could deal with that by forcing an index-vacuum pass at the
instant where the scan would wrap around, so that the list could be
cleared before putting any lower-numbered blocks into it.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Synchronized scans

2007-06-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 The vacuum-cost-limit issue may be sufficient reason to kill this idea;
 not sure.

We already have a much higher cost for blocks that cause i/o than blocks which
don't. I think if we had zero cost for blocks which don't cause i/o it would
basically work unless the sleep time was so large that the other scans managed
to cycle through the entire ring in that time.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-09 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 The vacuum-cost-limit issue may be sufficient reason to kill this idea;
 not sure.

 We already have a much higher cost for blocks that cause i/o than
 blocks which don't. I think if we had zero cost for blocks which don't
 cause i/o it would basically work unless the sleep time was so large
 that the other scans managed to cycle through the entire ring in that
 time.

I'm unconvinced.  That could perhaps work as long as the vacuum process
never did any I/O, ie was always a follower and never the leader of the
syncscan pack.  But if lazy_heap_scan is faster than a regular seqscan,
as I suspect, then it'd frequently become the leader and have to do the
I/O.  A few iterations of that will cause it to hit the vacuum cost
sleep, and that will make it fall behind the pack (and probably out of
sync entirely, unless the sleep is really short).

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I fixed a little off-by-one in backward scan, not inited branch, but I 
was unable to test it. It seems that code is actually never used because 
that case is optimized to a rewind in the executor. I marked those 
seemingly unreachable places in the code with a comment.


Actually it's not equivalent to a rewind, it's more like the startup
condition for an Index Scan Backward: start at the far end of the
relation and go backwards.  I suspect that the whole thing may be
unreachable code because the planner knows that seqscans are unordered
and backward-scan is only of interest for an ordered scan.


Right. I was looking at the case where you open a cursor and start 
immediately scanning backwards. That's optimized to a rewind.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Heikki Linnakangas

Tom Lane wrote:

Jeff Davis [EMAIL PROTECTED] writes:

Just to be sure: a backwards-started scan is currently unreachable code,
correct? 


[ yawn... ]  I think so, but I wouldn't swear to it right at the moment.
In any case it doesn't seem like a path that we need to optimize.


Agreed, let's just disable the reporting when moving backwards. Jeff's 
idea of tainting the whole scan if you ever move backwards is not bad 
either, but it's so uncommon that we don't really need to care either way.


BTW: Should we do the synchronization in the non-page-at-a-time mode? 
It's not many lines of code to do so, but IIRC that codepath is only 
used for catalog access. System tables really shouldn't grow that big, 
and if they do we shouldn't be doing seq scans for them anyway. Does the 
unstable ordering confuse any catalog accesses?


Here's an update of the patch. I reverted the behavior at end of scan 
back to the way it was in Jeff's original patch, and disabled reporting 
the position when moving backwards.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/heap/Makefile
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/Makefile,v
retrieving revision 1.15
diff -c -r1.15 Makefile
*** src/backend/access/heap/Makefile	8 Apr 2007 01:26:27 -	1.15
--- src/backend/access/heap/Makefile	31 May 2007 10:21:28 -
***
*** 12,18 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o
  
  all: SUBSYS.o
  
--- 12,18 
  top_builddir = ../../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = heapam.o hio.o rewriteheap.o tuptoaster.o syncscan.o
  
  all: SUBSYS.o
  
Index: src/backend/access/heap/heapam.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.234
diff -c -r1.234 heapam.c
*** src/backend/access/heap/heapam.c	30 May 2007 20:11:53 -	1.234
--- src/backend/access/heap/heapam.c	8 Jun 2007 09:53:12 -
***
*** 85,104 
  
  	/*
  	 * If the table is large relative to NBuffers, use a bulk-read access
! 	 * strategy, else use the default random-access strategy.  During a
! 	 * rescan, don't make a new strategy object if we don't have to.
  	 */
  	if (scan-rs_nblocks  NBuffers / 4 
  		!scan-rs_rd-rd_istemp)
  	{
  		if (scan-rs_strategy == NULL)
  			scan-rs_strategy = GetAccessStrategy(BAS_BULKREAD);
  	}
  	else
  	{
  		if (scan-rs_strategy != NULL)
  			FreeAccessStrategy(scan-rs_strategy);
  		scan-rs_strategy = NULL;
  	}
  
  	scan-rs_inited = false;
--- 85,108 
  
  	/*
  	 * If the table is large relative to NBuffers, use a bulk-read access
! 	 * strategy and enable synchronized scanning (see syncscan.c). 
! 	 * During a rescan, don't make a new strategy object if we don't have to.
  	 */
  	if (scan-rs_nblocks  NBuffers / 4 
  		!scan-rs_rd-rd_istemp)
  	{
  		if (scan-rs_strategy == NULL)
  			scan-rs_strategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 		scan-rs_startpage = ss_get_location(scan);
  	}
  	else
  	{
  		if (scan-rs_strategy != NULL)
  			FreeAccessStrategy(scan-rs_strategy);
  		scan-rs_strategy = NULL;
+ 
+ 		scan-rs_startpage = 0;
  	}
  
  	scan-rs_inited = false;
***
*** 229,234 
--- 233,239 
  	Snapshot	snapshot = scan-rs_snapshot;
  	bool		backward = ScanDirectionIsBackward(dir);
  	BlockNumber page;
+ 	BlockNumber prevpage;
  	Page		dp;
  	int			lines;
  	OffsetNumber lineoff;
***
*** 251,257 
  tuple-t_data = NULL;
  return;
  			}
! 			page = 0;			/* first page */
  			heapgetpage(scan, page);
  			lineoff = FirstOffsetNumber;		/* first offnum */
  			scan-rs_inited = true;
--- 256,262 
  tuple-t_data = NULL;
  return;
  			}
! 			page = scan-rs_startpage;			/* first page */
  			heapgetpage(scan, page);
  			lineoff = FirstOffsetNumber;		/* first offnum */
  			scan-rs_inited = true;
***
*** 274,279 
--- 279,288 
  	}
  	else if (backward)
  	{
+ 		/* Note: This is not normally reached, because the planner knows
+ 		 * that seq scans are unordered by nature, and doesn't generate
+ 		 * plans with backward scans. 
+ 		 */
  		if (!scan-rs_inited)
  		{
  			/*
***
*** 285,291 
  tuple-t_data = NULL;
  return;
  			}
! 			page = scan-rs_nblocks - 1;		/* final page */
  			heapgetpage(scan, page);
  		}
  		else
--- 294,305 
  tuple-t_data = NULL;
  return;
  			}
!  			/* Start from the final page. We don't want to participate
! 			 * in synchronized scanning if we're scanning backwards, that
! 			 * would just step on the toes of any forward-scanners.
! 			 */
!  			page = scan-rs_startpage = scan-rs_nblocks - 1;
! 
  			heapgetpage(scan, page);
  		}
  		else
***
*** 398,406 
 

Re: [PATCHES] Synchronized scans

2007-06-08 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 It occurs to me that there's an actual bug here for catalog access.
 The code assumes that it can measure rs_nblocks only once and not worry
 about tuples added beyond that endpoint.  But this is only true when
 using an MVCC-safe snapshot.

 You would only miss tuples inserted after you began the seqscan. After 
 you've began the scan, you're going to miss any tuples that are inserted 
 before the current position anyway, so stopping the scan early shouldn't 
 do any real harm.

Good point.

 It would only be a problem if you do something like:
 heap_beginscan(...)
 Lock
 while(heap_getnext) ...
 Unlock
 And expect to see all tuples inserted before acquiring the lock.

But that could be fixed by taking the lock before the heap_beginscan.
Indeed it's hard to conceive of a situation where you'd want/need to
take the lock afterward; in most cases the beginscan and the actual
scan are right together.

So I withdraw this complaint; it's complexity we don't need.  I'll
add a comment about the point though.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
BTW: Should we do the synchronization in the non-page-at-a-time mode? 
It's not many lines of code to do so, but IIRC that codepath is only 
used for catalog access. System tables really shouldn't grow that big, 
and if they do we shouldn't be doing seq scans for them anyway.


It occurs to me that there's an actual bug here for catalog access.
The code assumes that it can measure rs_nblocks only once and not worry
about tuples added beyond that endpoint.  But this is only true when
using an MVCC-safe snapshot.  In SnapshotNow mode, in particular, this
could easily lead to missing a valid tuple.  I'm not sure such a bug
has ever been seen in the wild, because we don't do that many heapscans
on system catalogs (and most of the ones we do do have some amount of
higher-level locking preventing a concurrent update).  But we ought to
fix it while we're touching the code now.  That will take re-measuring
rs_nblocks each time we consider stopping, and I think we definitely
will NOT want the syncscan code getting involved.


You would only miss tuples inserted after you began the seqscan. After 
you've began the scan, you're going to miss any tuples that are inserted 
before the current position anyway, so stopping the scan early shouldn't 
do any real harm. It would only be a problem if you do something like:

heap_beginscan(...)
Lock
while(heap_getnext) ...
Unlock

And expect to see all tuples inserted before acquiring the lock.

Good catch, though. Fixing it is probably a good idea even if it's not a 
problem for any existing code.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
 Just to be sure: a backwards-started scan is currently unreachable code,
 correct? 
 
 [ yawn... ]  I think so, but I wouldn't swear to it right at the moment.
 In any case it doesn't seem like a path that we need to optimize.

 Agreed, let's just disable the reporting when moving backwards.

Now that I'm awake, it is reachable code, per this comment:

 * Note: when we fall off the end of the scan in either direction, we
 * reset rs_inited.  This means that a further request with the same
 * scan direction will restart the scan, which is a bit odd, but a
 * request with the opposite scan direction will start a fresh scan
 * in the proper direction.  The latter is required behavior for cursors,
 * while the former case is generally undefined behavior in Postgres
 * so we don't care too much.

That is, if you run a cursor to the end and then back up, you'll go
through the init-in-the-backwards-direction code.

But we're agreed that we don't want to report when moving backwards,
so this is just an interesting note...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 11:05 +0100, Heikki Linnakangas wrote:
 BTW: Should we do the synchronization in the non-page-at-a-time mode? 
 It's not many lines of code to do so, but IIRC that codepath is only 
 used for catalog access. System tables really shouldn't grow that big, 
 and if they do we shouldn't be doing seq scans for them anyway. Does the 
 unstable ordering confuse any catalog accesses?

http://archives.postgresql.org/pgsql-hackers/2006-09/msg01199.php

There is a very minor assumption there that scans on pg_class will
return in the same order. I'm not sure if that's even a problem.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 On Fri, 2007-06-08 at 11:05 +0100, Heikki Linnakangas wrote:
 BTW: Should we do the synchronization in the non-page-at-a-time mode? 

 http://archives.postgresql.org/pgsql-hackers/2006-09/msg01199.php

 There is a very minor assumption there that scans on pg_class will
 return in the same order. I'm not sure if that's even a problem.

I'm inclined not to worry about that --- the odds of trouble from a
concurrent schema update are at least as large as the odds of trouble
from syncscan-induced variance.  The correct fix would be to make ANALYZE
sort the rels by OID, anyhow, not to dumb down the scan.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 12:22 -0400, Tom Lane wrote:
 Now that I'm awake, it is reachable code, per this comment:
 
  * Note: when we fall off the end of the scan in either direction, we
  * reset rs_inited.  This means that a further request with the same
  * scan direction will restart the scan, which is a bit odd, but a

I'm confused about this part of the comment.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Here's an update of the patch. I reverted the behavior at end of scan 
 back to the way it was in Jeff's original patch, and disabled reporting 
 the position when moving backwards.

Applied with minor editorializations --- notably, I got rid of the
HeapScanDesc dependency in syncscan.c's API, so that it could be used
in other contexts (VACUUM, anyone?).  There were a few glitches in the
heapam.c code too.

 I didn't touch the large scan threshold of NBuffers / 4 Tom that 
 committed as part of the buffer ring patch. IOW I removed the GUC 
 variable from the patch. I think the jury is still out there on this one.

Yeah, this could do with more testing.  I concur with the idea that the
threshold should be the same for both bits of code, though.  Otherwise
we have four behaviors to try to tune, instead of two.

 I included a basic regression test as well.

I did not commit this, as it seemed a bit useless --- it's looking for a
minor side-effect and not doing much of anything to prove that the code
does what's intended.  Possibly we could put in a real test after
Greg's concurrent-psql thing is in.

Jeff wrote:
 I might go so far as to suggest if the scan *ever* moves backwards, we
 taint the scan such that it never reports.

This would be a trivial addition to the code-as-committed (clear
rs_syncscan upon backing up by a page) but I didn't add it.  Any
strong feelings one way or the other?  AFAIK the only case where
it'd happen is if someone reads forwards in a large-seqscan cursor
for awhile and then reads backwards.  You could argue that doing
that is a good reason to drop them out of the syncscan pack ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 On Fri, 2007-06-08 at 12:22 -0400, Tom Lane wrote:
 Now that I'm awake, it is reachable code, per this comment:
 
 * Note: when we fall off the end of the scan in either direction, we
 * reset rs_inited.  This means that a further request with the same
 * scan direction will restart the scan, which is a bit odd, but a

 I'm confused about this part of the comment.

If you tell the SeqScan plan node to fetch forward over and over, it
eventually will return NULL indicating end-of-data.  If you then demand
*another* forward fetch, it'll start the whole scan over from scratch,
because of the fact that it sees rs_inited clear.  You might've expected
another NULL but that's not what you'll get.

In general, however, the behavior of a plan node in this scenario is
not defined and not relied on by anything --- some of the join node
types will crash outright if you try to fetch past EOF, IIRC.  The case
that *is* required is to be able to scan backwards after returning NULL.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-08 Thread Jeff Davis
On Fri, 2007-06-08 at 11:57 -0700, Jeff Davis wrote:
 On Fri, 2007-06-08 at 14:36 -0400, Tom Lane wrote:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
   Here's an update of the patch. I reverted the behavior at end of scan 
   back to the way it was in Jeff's original patch, and disabled reporting 
   the position when moving backwards.
  
  Applied with minor editorializations --- notably, I got rid of the
  HeapScanDesc dependency in syncscan.c's API, so that it could be used
  in other contexts (VACUUM, anyone?).  There were a few glitches in the
  heapam.c code too.
 
 I think VACUUM would be an ideal place for it. I assume we don't want to

I have a few thoughts:

 * For a large table, do lazy_scan_heap, scan_heap, and a sequential
scan usually progress at approximately the same rate?

 * Are there any other parts of the vacuum process that may benefit?

 * Just adding in the syncscan to scan_heap and lazy_scan_heap seems
very easy at first thought. Are there any complications that I'm
missing?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-07 Thread Jeff Davis
On Thu, 2007-06-07 at 22:52 -0400, Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I fixed a little off-by-one in backward scan, not inited branch, but I 
  was unable to test it. It seems that code is actually never used because 
  that case is optimized to a rewind in the executor. I marked those 
  seemingly unreachable places in the code with a comment.
 
 Actually it's not equivalent to a rewind, it's more like the startup
 condition for an Index Scan Backward: start at the far end of the
 relation and go backwards.  I suspect that the whole thing may be
 unreachable code because the planner knows that seqscans are unordered
 and backward-scan is only of interest for an ordered scan.  But be that
 as it may: do we even want a backwards-running scan to participate in a
 syncscan group?  Unless *all* the backends are doing the same thing,
 it will not help and instead will bollix the syncscan for everyone else.
 I'm inclined to disable use of syncscan.c altogether when the scan is

Just to be sure: a backwards-started scan is currently unreachable code,
correct? 

But as long as the code is there (reachable or not) it sounds good to
disable sync scan in that case.

 started backwards.  It also seems prudent to suppress ss_report_location
 calls when stepping backward in a generally-forwards scan.  Thoughts?

I agree that we should disable ss_report_location if the scan is moving
backwards.

I might go so far as to suggest if the scan *ever* moves backwards, we
taint the scan such that it never reports.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-07 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 Just to be sure: a backwards-started scan is currently unreachable code,
 correct? 

[ yawn... ]  I think so, but I wouldn't swear to it right at the moment.
In any case it doesn't seem like a path that we need to optimize.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Heikki Linnakangas

Tom Lane wrote:

But note that barring backend crash, once all the scans are done it is
guaranteed that the hint will be removed --- somebody will be last to
update the hint, and therefore will remove it when they do heap_endscan,
even if others are not quite done.  This is good in the sense that
later-starting backends won't be fooled into starting at what is
guaranteed to be the most pessimal spot, but it's got a downside too,
which is that there will be windows where seqscans are in process but
a newly started scan won't see them.  Maybe that's a killer objection.


I think the way the patch is now is better than trying to remove the 
hints, but I don't feel strongly either way.


However, I don't think we should try hard to mask the issue. It just 
means people are more likely to miss it in testing, and run into it in 
production. It's better to find out sooner than later.


It might be a good idea to preserve the order within a transaction, 
though that means more code.



When exactly is the hint updated?  I gathered from something Heikki said
that it's set after processing X amount of data, but I think it might be
better to set it *before* processing X amount of data.  That is, the
hint means I'm going to be scanning at least threshold blocks
starting here, not I have scanned threshold blocks ending here,
which seems like the interpretation that's being used at the moment.
What that would mean is that successive LIMIT 1000 calls would in fact
all start at the same place, barring interference from other backends.


I don't see how it makes any difference whether you update the hint 
before or after processing. Running a LIMIT 1000 query repeatedly will 
start from the same place in any case, assuming 1000 tuples fit in the 
report interval, which is 128KB currently.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Jeff Davis
On Mon, 2007-06-04 at 21:39 -0400, Tom Lane wrote:
 idea of deleting the hint.  But if we could change the hint behavior to
 say start reading here, successive short LIMITed reads would all start
 reading from the same point, which fixes both my reproducibility concern
 and Heikki's original point about being able to re-use cached data.
 You'd only get into the irreproducible behavior if the LIMIT was larger
 than the amount of data scanned between hint updates.

That's how it works now. Small limit queries don't change the location
in the hint, so if you repeat them, the queries keep starting from the
same place, and fetching the same tuples.

Large LIMIT queries (larger than the number of pages between updates) do
change the location in the hint, and so that's really the case you're
worried about.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 That's how it works now. Small limit queries don't change the location
 in the hint, so if you repeat them, the queries keep starting from the
 same place, and fetching the same tuples.

OK, maybe the problem's not as severe as I thought then.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized scans

2007-06-05 Thread Jeff Davis
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote:
 I'm now done with this patch and testing it.
 
 

One difference between our patches is that, in my patch, the ending
condition of the scan is after the hint is set back to the starting
position. 

That means, in my patch, if you do:
SELECT * FROM bigtable;
SELECT * FROM bigtable;

with no concurrency at all, the returned order will be the same.

In your patch, each full table scan leaves the hint at 16 pages before
the position it started in, leading to different orders on the full
table scans.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 For the record, this patch has a small negative impact on scans like 
 SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
 HEAD the first 1000 rows will stay in buffer cache, but with the patch 
 each scan will start from roughly where previous one stopped, requiring 
 more pages to be read from disk each time. I don't think it's something 
 to worry about in practice, but I thought I'd mention it.

Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
For the record, this patch has a small negative impact on scans like 
SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
HEAD the first 1000 rows will stay in buffer cache, but with the patch 
each scan will start from roughly where previous one stopped, requiring 
more pages to be read from disk each time. I don't think it's something 
to worry about in practice, but I thought I'd mention it.


Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.


That's a very fundamental result of this patch, unfortunately. It only 
happens on scans on tables larger than the threshold. And because we 
only report the current scan location every 128KB, if you repeat the 
same SELECT .. LIMIT X query with no other scanners on that table, 
you'll get the same results as long as X is smaller than 128KB.


I thought we've been through this issue already...

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Bruce Momjian
Heikki Linnakangas wrote:
 Tom Lane wrote:
  Heikki Linnakangas [EMAIL PROTECTED] writes:
  For the record, this patch has a small negative impact on scans like 
  SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
  HEAD the first 1000 rows will stay in buffer cache, but with the patch 
  each scan will start from roughly where previous one stopped, requiring 
  more pages to be read from disk each time. I don't think it's something 
  to worry about in practice, but I thought I'd mention it.
  
  Urgh.  The answers change depending on (more or less) the phase of the
  moon?  I've got a serious problem with that.  You might look back to
  1997 when GEQO very nearly got tossed out entirely because it destroyed
  reproducibility of query results.
 
 That's a very fundamental result of this patch, unfortunately. It only 
 happens on scans on tables larger than the threshold. And because we 
 only report the current scan location every 128KB, if you repeat the 
 same SELECT .. LIMIT X query with no other scanners on that table, 
 you'll get the same results as long as X is smaller than 128KB.
 
 I thought we've been through this issue already...

Agreed.  I thought we always said that a LIMIT without an ORDER BY was
meaningless, particuarly because an intervening UPDATE could have moved
rows to another place in the table.  In fact, at one time we considered
prevening LIMIT without ORDER BY because it was meaningless, but decided
if people want unstable results, they should be able to get them.

An argument could be made that a LIMIT without ORDER BY on a table
locked read-only should be stable.

As I understand it, the problem is that while currently LIMIT without
ORDER BY always starts at the beginning of the table, it will not with
this patch.  I consider that acceptable.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 As I understand it, the problem is that while currently LIMIT without
 ORDER BY always starts at the beginning of the table, it will not with
 this patch.  I consider that acceptable.

It's definitely going to require stronger warnings than we have now
about using LIMIT without ORDER BY.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Bruce Momjian [EMAIL PROTECTED] writes:

As I understand it, the problem is that while currently LIMIT without
ORDER BY always starts at the beginning of the table, it will not with
this patch.  I consider that acceptable.


It's definitely going to require stronger warnings than we have now
about using LIMIT without ORDER BY.


Along the lines of

NOTICE: LIMIT without ORDER BY returns an arbitrary set of matching rows

perhaps? I wonder how easy it is to detect that in the planner.

Or just a remark in the manual?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 10:53 +0100, Heikki Linnakangas wrote:
 I'm now done with this patch and testing it.
 

Great!

 For the record, this patch has a small negative impact on scans like 
 SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
 HEAD the first 1000 rows will stay in buffer cache, but with the patch 
 each scan will start from roughly where previous one stopped, requiring 
 more pages to be read from disk each time. I don't think it's something 
 to worry about in practice, but I thought I'd mention it.
 

No surprise here, as you and Bruce have already pointed out.

If we wanted to reduce the occurrence of this phenomena, we could
perhaps time out the hints so that it's impossible to pick up a hint
from a scan that finished 5 minutes ago.

It doesn't seem helpful to further obscure the non-determinism of the
results, however.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Alvaro Herrera
Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  For the record, this patch has a small negative impact on scans like 
  SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
  HEAD the first 1000 rows will stay in buffer cache, but with the patch 
  each scan will start from roughly where previous one stopped, requiring 
  more pages to be read from disk each time. I don't think it's something 
  to worry about in practice, but I thought I'd mention it.
 
 Urgh.  The answers change depending on (more or less) the phase of the
 moon?  I've got a serious problem with that.  You might look back to
 1997 when GEQO very nearly got tossed out entirely because it destroyed
 reproducibility of query results.

What about the simple idea of just disabling the use of a sync scan when
the query has LIMIT and no ORDER BY, and start always at block 0 in that
case?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
For the record, this patch has a small negative impact on scans like 
SELECT * FROM foo LIMIT 1000. If such a scan is run repeatedly, in CVS 
HEAD the first 1000 rows will stay in buffer cache, but with the patch 
each scan will start from roughly where previous one stopped, requiring 
more pages to be read from disk each time. I don't think it's something 
to worry about in practice, but I thought I'd mention it.

Urgh.  The answers change depending on (more or less) the phase of the
moon?  I've got a serious problem with that.  You might look back to
1997 when GEQO very nearly got tossed out entirely because it destroyed
reproducibility of query results.


What about the simple idea of just disabling the use of a sync scan when
the query has LIMIT and no ORDER BY, and start always at block 0 in that
case?


That handles the LIMIT case, but you would still observe the different 
ordering. And some people do LIMIT-like behavior in client side, by 
opening a cursor and only fetching first n rows.


I don't think anyone can reasonably expect to get the same ordering when 
the same query issued twice in general, but within the same transaction 
it wouldn't be that unreasonable. If we care about that, we could keep 
track of starting locations per transaction, only do the synchronization 
on the first scan in a transaction, and start subsequent scans from the 
same page as the first one. That way if you issue the same query twice 
in a transaction, or do something like:

BEGIN;
SELECT * FROM queue FOR UPDATE LIMIT 10
do stuff..
DELETE FROM queue LIMIT 10
COMMIT;

you'd get the expected result.

I think the warning on LIMIT without ORDER BY is a good idea, regardless 
of the synchronized scans patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

No surprise here, as you and Bruce have already pointed out.

If we wanted to reduce the occurrence of this phenomena, we could
perhaps time out the hints so that it's impossible to pick up a hint
from a scan that finished 5 minutes ago.

It doesn't seem helpful to further obscure the non-determinism of the
results, however.


I agree it's probably not a good idea to try masking this. It'll just 
make it harder to hit the issue in testing, so that you run into it in 
production.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I don't think anyone can reasonably expect to get the same ordering when 
 the same query issued twice in general, but within the same transaction 
 it wouldn't be that unreasonable. If we care about that, we could keep 
 track of starting locations per transaction, only do the synchronization 
 on the first scan in a transaction, and start subsequent scans from the 
 same page as the first one.

I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?

 I think the warning on LIMIT without ORDER BY is a good idea, regardless 
 of the synchronized scans patch.

I seriously doubt that can be done in any way that doesn't both warn
about perfectly-safe cases and fail to warn about other unsafe ones.
Furthermore, it's not uncommon for people to do SELECT * ... LIMIT 1
just to remind themselves of column names or whatever.  Do we really
want the thing to be so nannyish?  I was envisioning simply a stronger
warning in the SELECT reference page ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
I don't think anyone can reasonably expect to get the same ordering when 
the same query issued twice in general, but within the same transaction 
it wouldn't be that unreasonable. If we care about that, we could keep 
track of starting locations per transaction, only do the synchronization 
on the first scan in a transaction, and start subsequent scans from the 
same page as the first one.


I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?


Because there's no reason why it should, and it would require a lot more 
bookkeeping. There can be many scanners on the same table, so we'd need 
to implement some kind of reference counting, which means having to 
reliably decrement the counter when a scan terminates.


In any case if there actually is a concurrent scan, you'd still see 
different ordering. Removing the entry when a scan is over would just 
make it harder to trigger.


I think the warning on LIMIT without ORDER BY is a good idea, regardless 
of the synchronized scans patch.


I seriously doubt that can be done in any way that doesn't both warn
about perfectly-safe cases and fail to warn about other unsafe ones.
Furthermore, it's not uncommon for people to do SELECT * ... LIMIT 1
just to remind themselves of column names or whatever.  Do we really
want the thing to be so nannyish?  


It really depends on how many false negatives and positives it gives. If 
too many, it's just annoying, but if it's reasonably accurate I think it 
would be OK to remind people running queries like that.



I was envisioning simply a stronger warning in the SELECT reference page ...


I doubt the people that would be bitten by this read the SELECT 
reference page.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Michael Glaesemann


On Jun 4, 2007, at 15:24 , Heikki Linnakangas wrote:

I don't think anyone can reasonably expect to get the same ordering  
when the same query issued twice in general, but within the same  
transaction it wouldn't be that unreasonable.


The order rows are returned without an ORDER BY clause *is*  
implementation dependent, and is not guaranteed, at least by the  
spec. Granted, LIMIT without ORDER BY (and DISTINCT for that matter)  
brings this into sharp relief.


I think the warning on LIMIT without ORDER BY is a good idea,  
regardless of the synchronized scans patch.


I'm not saying this isn't a good idea, but are there other places  
where there might be gotchas for the unwary, such as DISTINCT without  
ORDER BY or (for an unrelated example) UNION versus UNION ALL? How  
many of these types of messages would be useful?


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 16:42 -0400, Tom Lane wrote:
 Heikki Linnakangas [EMAIL PROTECTED] writes:
  I don't think anyone can reasonably expect to get the same ordering when 
  the same query issued twice in general, but within the same transaction 
  it wouldn't be that unreasonable. If we care about that, we could keep 
  track of starting locations per transaction, only do the synchronization 
  on the first scan in a transaction, and start subsequent scans from the 
  same page as the first one.
 
 I think the real problem here is that the first scan is leaving state
 behind that changes the behavior of the next scan.  Which can have no
 positive benefit, since obviously the first scan is not still
 proceeding; the best you can hope for is that it's a no-op and the worst
 case is that it actively pessimizes things.  Why doesn't the patch
 remove the shmem entry at scan termination?
 

Sounds like a reasonable idea to me. We could add the PID to the data
structure so that it would only remove the hint if it's the one that set
the hint. I think we'd just need to call a function to do that from
heap_endscan(), correct?

However, we couldn't 100% guarantee that the state would be cleared. A
backend could be killed in the middle of a scan.

The case you're worried about seems very narrow to me, and I think
actively pessimizes is too strong. Unless I misunderstand, the best
you can hope for no-op happens in all cases except a most bizarre one:
that in which you're executing repeated identical LIMIT queries with no
ORDER BY; and the tuples returned occupy more than 128K (16 pages is the
reporting period) but fewer would be effective to cache; and the table
in question is larger than the large table threshold. I'm just trying to
add some perspective about what we're fixing, here.

But it's fair to say that a scan should clear any state when it's done.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote:
  I think the real problem here is that the first scan is leaving state
  behind that changes the behavior of the next scan.  Which can have no
  positive benefit, since obviously the first scan is not still
  proceeding; the best you can hope for is that it's a no-op and the worst
  case is that it actively pessimizes things.  Why doesn't the patch
  remove the shmem entry at scan termination?
 
 Because there's no reason why it should, and it would require a lot more 
 bookkeeping. There can be many scanners on the same table, so we'd need 
 to implement some kind of reference counting, which means having to 
 reliably decrement the counter when a scan terminates.
 

That's what I thought at first, and why I didn't do it. Right now I'm
thinking we could just add the PID to the hint, so that it would only
remove its own hint. Would that work?

It's still vulnerable to a backend being killed and the hint hanging
around. However, the next scan would clear get it back to normal.
Reference counting would cause weirdness if a backend died or something,
because it would never decrement to 0.

 In any case if there actually is a concurrent scan, you'd still see 
 different ordering. Removing the entry when a scan is over would just 
 make it harder to trigger.
 

Agreed. I don't know for sure whether that's good or bad, but it would
make the nondeterminism less immediately visible.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Michael Glaesemann


On Jun 4, 2007, at 16:34 , Heikki Linnakangas wrote:

LIMIT without ORDER BY is worse because it not only returns tuples  
in different order, but it can return different tuples altogether  
when you run it multiple times.


Wouldn't DISTINCT ON suffer from the same issue without ORDER BY?

Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Heikki Linnakangas

Jeff Davis wrote:

On Mon, 2007-06-04 at 22:09 +0100, Heikki Linnakangas wrote:

I think the real problem here is that the first scan is leaving state
behind that changes the behavior of the next scan.  Which can have no
positive benefit, since obviously the first scan is not still
proceeding; the best you can hope for is that it's a no-op and the worst
case is that it actively pessimizes things.  Why doesn't the patch
remove the shmem entry at scan termination?
Because there's no reason why it should, and it would require a lot more 
bookkeeping. There can be many scanners on the same table, so we'd need 
to implement some kind of reference counting, which means having to 
reliably decrement the counter when a scan terminates.




That's what I thought at first, and why I didn't do it. Right now I'm
thinking we could just add the PID to the hint, so that it would only
remove its own hint. Would that work?


Were you thinking of storing the PID of the backend that originally 
created the hint, or updating the PID every time the hint is updated? In 
any case, we still wouldn't know if there's other scanners still running.


We could just always remove the hint when a scan ends, and rely on the 
fact that if there's other scans still running they will put the hint 
back very quickly. There would then be a small window where there's no 
hint but a scan is active, and a new scan starting during that window 
would fail to synchronize with the other scanners.



It's still vulnerable to a backend being killed and the hint hanging
around. However, the next scan would clear get it back to normal.


Oh, did you mean that the PID would be updated whenever a new scan 
starts? So that the PID stored would always be the PID of the latest 
scanner. That might work pretty well, though a small scan with a LIMIT, 
or any other situation where some scans run faster than others, might 
clear the hint prematurely while other scans are still running.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 My thought was that every time the location was reported by a backend,
 it would store 3 pieces of information, not 2:
  * relfilenode
  * the PID of the backend that created or updated this particular hint
 last
  * the location

 Then, on heap_endscan() (if that's the right place), we find the hint,
 and if the PID matches, we remove it. If not, it does nothing.

 This would only matter when there weren't other scans. When concurrent
 scans were happening, chances are the PID wouldn't match anyway, and
 thus not be removed.

But note that barring backend crash, once all the scans are done it is
guaranteed that the hint will be removed --- somebody will be last to
update the hint, and therefore will remove it when they do heap_endscan,
even if others are not quite done.  This is good in the sense that
later-starting backends won't be fooled into starting at what is
guaranteed to be the most pessimal spot, but it's got a downside too,
which is that there will be windows where seqscans are in process but
a newly started scan won't see them.  Maybe that's a killer objection.

When exactly is the hint updated?  I gathered from something Heikki said
that it's set after processing X amount of data, but I think it might be
better to set it *before* processing X amount of data.  That is, the
hint means I'm going to be scanning at least threshold blocks
starting here, not I have scanned threshold blocks ending here,
which seems like the interpretation that's being used at the moment.
What that would mean is that successive LIMIT 1000 calls would in fact
all start at the same place, barring interference from other backends.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Gregory Stark

Heikki Linnakangas [EMAIL PROTECTED] writes:

 Were you thinking of storing the PID of the backend that originally created 
 the
 hint, or updating the PID every time the hint is updated? In any case, we 
 still
 wouldn't know if there's other scanners still running.

My reaction was if you always put the pid in when updating the hint it might
work out pretty well. When you finish you remove the hint iff the pid is your
own. 

It has exactly the same properties you describe of leaving a small window with
no hint when you finish a scan but only if you were the last to update the
scan position which i would expect would be pretty rare except for the last
scanner.

If a backend died it would leave a scan position behind but the next scanner
on that table would overwrite the pid and then remove it when it's finished.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Jeff Davis
On Mon, 2007-06-04 at 18:25 -0400, Tom Lane wrote:
 But note that barring backend crash, once all the scans are done it is
 guaranteed that the hint will be removed --- somebody will be last to
 update the hint, and therefore will remove it when they do heap_endscan,
 even if others are not quite done.  This is good in the sense that
 later-starting backends won't be fooled into starting at what is
 guaranteed to be the most pessimal spot, but it's got a downside too,
 which is that there will be windows where seqscans are in process but
 a newly started scan won't see them.  Maybe that's a killer objection.

I don't think it would be a major objection. If there aren't other
sequential scans in progress, the point is moot, and if there are:
(a) the hint has a lower probability of being removed, since it may
contain the PID of one of those other scans.
(b) the hint is likely to be replaced quite quickly

The problem is, I think people would be more frustrated by 1 in 1000
queries starting the scan in the wrong place because a hint was deleted,
because that could cause a major difference in performance. I expect the
current patch would have more consistent performance for that reason.

To me, it seems to be a small benefit and a small cost. It's hard for me
to feel very strongly either way.

 When exactly is the hint updated?  I gathered from something Heikki said
 that it's set after processing X amount of data, but I think it might be
 better to set it *before* processing X amount of data.  That is, the
 hint means I'm going to be scanning at least threshold blocks
 starting here, not I have scanned threshold blocks ending here,
 which seems like the interpretation that's being used at the moment.
 What that would mean is that successive LIMIT 1000 calls would in fact
 all start at the same place, barring interference from other backends.
 

If I understand correctly, this is a one-page difference in the report
location, right? We can either report that we've just finished scanning
block 1023 (ending an X block chunk of reading) and another backend can
start scanning at 1023 (current behavior); or we could report that we're
about to scan an X block chunk of data starting with block 1024, and the
new scan can start at 1024. We don't want the new scan to jump in ahead
of the existing scan, because then we're introducing uncached blocks
between the two scans -- risking divergence. 

If the data occupies less than X data pages, the LIMIT queries will be
deterministic for single-scans anyway, because no reports will happen
(other than the starting location, which won't matter in this case).

If the data is more than that, then at least one report would have
happened. At this point, you're talking about rewinding the scan (how
far?), which I originally coded for with sync_seqscan_offset. That
feature didn't prove very useful (yet), so I removed it. 

Regards,
Jeff Davis





---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] Synchronized scans

2007-06-04 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 The problem is, I think people would be more frustrated by 1 in 1000
 queries starting the scan in the wrong place because a hint was deleted,

Yeah --- various people have been complaining recently about how we have
good average performance and bad worst case, and this behavior would
definitely be more of the same.  So I'm kind of backing away from the
idea of deleting the hint.  But if we could change the hint behavior to
say start reading here, successive short LIMITed reads would all start
reading from the same point, which fixes both my reproducibility concern
and Heikki's original point about being able to re-use cached data.
You'd only get into the irreproducible behavior if the LIMIT was larger
than the amount of data scanned between hint updates.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings