Re: [HACKERS] Synchronized Scan update

2007-03-14 Thread Zeugswetter Andreas ADI SD

  The advantage of sync_scan_offset is that, in some situations, a 
  second scan can actually finish faster than if it were the only
query 
  executing, because a previous scan has already caused some blocks to

  be cached. However, 16 is a small number because that benefit would 
  only be once per scan, and sync scans are only helpful on large
tables.

Agreed.

 Alright, understood. That last part is actually something I 
 now want to avoid because it's using the current 
 cache-spoiling behaviour of seqscans to advantage. I'd like 
 to remove that behaviour, but it sounds like we can have both
 - SeqScans that don't spoil cache
 - Synch scans by setting sync_scan_offset to zero.
 
I like the idea of reducing tuning parameters, but we should, at
a 
minimum, still allow an on/off button for sync scans. My tests 
revealed that the wrong combination of 
OS/FS/IO-Scheduler/Controller could result in bad I/O behavior.
   
   Agreed
   
  
  Do you have an opinion about sync_scan_threshold versus a simple 
  sync_scan_enable?
 
 enable_sync_scan?

Seems the suggested guc's are very related. IIRC The agreed suggestion
was to use NBuffers (or a percentage thereof ?) to decide whether to
spoil the buffer cache for a seq scan. I seems this same metric should
be used to decide whether to sync a scan when sync scan is enabled. So
when the tablesize is below NBuffers (or a percentage thereof) you
neighter recycle buffers nor sync the seq scans.

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Mon, 2007-03-12 at 13:21 +, Simon Riggs wrote:
 So based on those thoughts, sync_scan_offset should be fixed at 16,
 rather than being variable. In addition, ss_report_loc() should only
 report its position every 16 blocks, rather than do this every time,
 which will reduce overhead of this call.

If we fix sync_scan_offset at 16, we might as well just get rid of it.
Sync scans are only useful on large tables, and getting a free 16 pages
over a scan isn't worth the trouble. However, even without
sync_scan_offset, sync scans are still a valuable feature.

I agree that ss_report_loc() doesn't need to report on every call. If
there's any significant overhead I agree that it should report less
often. Do you think that the overhead is significant on such a simple
function?

 
 To match that, scan_recycle_buffers should be fixed at 32. So GUCs for
 sync_scan_offset and scan_recycle_buffers would not be required at all.
 
 IMHO we can also remove sync_scan_threshold and just use NBuffers
 instead. That way we get the benefit of both patches or neither, making
 it easier to understand what's going on.

I like the idea of reducing tuning parameters, but we should, at a
minimum, still allow an on/off button for sync scans. My tests revealed
that the wrong combination of OS/FS/IO-Scheduler/Controller could result
in bad I/O behavior.

 If need be, the value of scan_recycle_buffers can be varied upwards
 should the scans drift apart, as a way of bringing them back together.

If the scans aren't being brought together, that means that one of the
scans is CPU bound or outside the combined cache trail (shared_buffers
+ OS buffer cache). 

 We aren't tracking whether they are together or apart, so I would like
 to see some debug output from synch scans to allow us to assess how far
 behind the second scan is as it progresses. e.g.
 LOG:  synch scan currently on block N, trailing pathfinder by M blocks
 issued every 128 blocks as we go through the scans. 
 
 Thoughts?
 

It's hard to track where all the scans are currently. One of the
advantages of my patch is its simplicity: the scans don't need to know
about other specific scans, and there is no concept in the code of a
head scan or a pack.

There is no easy way to tell which scan is ahead and which is behind.
There was a discussion when I submitted this proposal at the beginning
of 8.3, but I didn't see enough benefit to justify all of the costs and
risks associated with scans communicating between eachother. I
certainly can't implement that kind of thing before feature freeze, and
I think there's a risk of lock contention for the communication
required. I'm also concerned that -- if the scans are too
interdependent -- it would make postgres less robust against the
disappearance of a single backend (i.e. what if the backend that is
leading a scan dies?).

Regards,
Jeff Davis




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Tom Lane
Jeff Davis [EMAIL PROTECTED] writes:
 I agree that ss_report_loc() doesn't need to report on every call. If
 there's any significant overhead I agree that it should report less
 often. Do you think that the overhead is significant on such a simple
 function?

One extra LWLock cycle per page processed definitely *is* a significant
overhead ... can you say context swap storm?  I'd think about doing it
once every 100 or so pages.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
 Jeff Davis [EMAIL PROTECTED] writes:
  I agree that ss_report_loc() doesn't need to report on every call. If
  there's any significant overhead I agree that it should report less
  often. Do you think that the overhead is significant on such a simple
  function?
 
 One extra LWLock cycle per page processed definitely *is* a significant
 overhead ... can you say context swap storm?  I'd think about doing it
 once every 100 or so pages.
 

No lock is needed to store the hint. If somehow the hint (which is
stored in a static table, no pointers) gets invalid data due to a race
condition, the new scan will simply consider the hint invalid and start
at 0.

I did this precisely to avoid causing a performance regression for usage
patterns that don't benefit from sync scans.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Simon Riggs
On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
 On Mon, 2007-03-12 at 13:21 +, Simon Riggs wrote:
  So based on those thoughts, sync_scan_offset should be fixed at 16,
  rather than being variable. In addition, ss_report_loc() should only
  report its position every 16 blocks, rather than do this every time,
  which will reduce overhead of this call.
 
 If we fix sync_scan_offset at 16, we might as well just get rid of it.
 Sync scans are only useful on large tables, and getting a free 16 pages
 over a scan isn't worth the trouble. However, even without
 sync_scan_offset, 

Not sure what you mean by a free 16 pages. Please explain?

 sync scans are still a valuable feature.

I have always thought synch scans are a valuable feature too. 

 I agree that ss_report_loc() doesn't need to report on every call. If
 there's any significant overhead I agree that it should report less
 often. Do you think that the overhead is significant on such a simple
 function?

Lets try without it and see. There's no need to access shared memory so
often.

 I like the idea of reducing tuning parameters, but we should, at a
 minimum, still allow an on/off button for sync scans. My tests revealed
 that the wrong combination of OS/FS/IO-Scheduler/Controller could result
 in bad I/O behavior.

Agreed

  If need be, the value of scan_recycle_buffers can be varied upwards
  should the scans drift apart, as a way of bringing them back together.
 
 If the scans aren't being brought together, that means that one of the
 scans is CPU bound or outside the combined cache trail (shared_buffers
 + OS buffer cache). 
 
  We aren't tracking whether they are together or apart, so I would like
  to see some debug output from synch scans to allow us to assess how far
  behind the second scan is as it progresses. e.g.
  LOG:  synch scan currently on block N, trailing pathfinder by M blocks
  issued every 128 blocks as we go through the scans. 
  
  Thoughts?
  
 
 It's hard to track where all the scans are currently. One of the
 advantages of my patch is its simplicity: the scans don't need to know
 about other specific scans, and there is no concept in the code of a
 head scan or a pack.

I'd still like to be able to trace each scan to see how far ahead/behind
it is from the other scans on the same table, however we do that.

Any backend can read the position of other backend's scans, so it should
be easy enough to put in a regular LOG entry that shows how far
ahead/behind they are from other scans. We can trace just one backend
and have it report on where it is with respect to other backends, or you
could have them all calculate their position and have just the lead scan
report the position of all other scans.

 There is no easy way to tell which scan is ahead and which is behind.
 There was a discussion when I submitted this proposal at the beginning
 of 8.3, but I didn't see enough benefit to justify all of the costs and
 risks associated with scans communicating between eachother. I
 certainly can't implement that kind of thing before feature freeze, and
 I think there's a risk of lock contention for the communication
 required. I'm also concerned that -- if the scans are too
 interdependent -- it would make postgres less robust against the
 disappearance of a single backend (i.e. what if the backend that is
 leading a scan dies?).

I've not mentioned that again.

I'd like to see the trace option to allow us to tell whether its working
as well as we'd like it to pre-release and in production. Also I want to
see whether various settings of scan_recycle_buffers help/hinder the
effectiveness of synch scans, as others have worried it might.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
  One extra LWLock cycle per page processed definitely *is* a significant
  overhead ... can you say context swap storm?  I'd think about doing it
  once every 100 or so pages.
  
 
 No lock is needed to store the hint. If somehow the hint (which is
 stored in a static table, no pointers) gets invalid data due to a race
 condition, the new scan will simply consider the hint invalid and start
 at 0.
 
 I did this precisely to avoid causing a performance regression for usage
 patterns that don't benefit from sync scans.
 

I'd also like to add that, if a lock was required, a constant offset
would also seem to prone to a context swap storm; it would just happen
100th as much. We'd need to do something to spread the locks over time.

That being said, I'll adjust it to report once per hundred pages anyway,
because there's really no drawback.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Simon Riggs
On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
 On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
  Jeff Davis [EMAIL PROTECTED] writes:
   I agree that ss_report_loc() doesn't need to report on every call. If
   there's any significant overhead I agree that it should report less
   often. Do you think that the overhead is significant on such a simple
   function?
  
  One extra LWLock cycle per page processed definitely *is* a significant
  overhead ... can you say context swap storm?  I'd think about doing it
  once every 100 or so pages.
  
 
 No lock is needed to store the hint. If somehow the hint (which is
 stored in a static table, no pointers) gets invalid data due to a race
 condition, the new scan will simply consider the hint invalid and start
 at 0.
 
 I did this precisely to avoid causing a performance regression for usage
 patterns that don't benefit from sync scans.

Shared memory access is still a performance/scalability concern because
so many people want access to it at the same time. 

There really is no need to do this after each block. 8 CPUs ought to be
able to do 8 scans without tripping over each other. Especially if they
are on separate tables.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Tue, 2007-03-13 at 17:17 +, Simon Riggs wrote:
 On Tue, 2007-03-13 at 10:08 -0700, Jeff Davis wrote:
  On Tue, 2007-03-13 at 12:53 -0400, Tom Lane wrote:
   Jeff Davis [EMAIL PROTECTED] writes:
I agree that ss_report_loc() doesn't need to report on every call. If
there's any significant overhead I agree that it should report less
often. Do you think that the overhead is significant on such a simple
function?
   
   One extra LWLock cycle per page processed definitely *is* a significant
   overhead ... can you say context swap storm?  I'd think about doing it
   once every 100 or so pages.
   
  
  No lock is needed to store the hint. If somehow the hint (which is
  stored in a static table, no pointers) gets invalid data due to a race
  condition, the new scan will simply consider the hint invalid and start
  at 0.
  
  I did this precisely to avoid causing a performance regression for usage
  patterns that don't benefit from sync scans.
 
 Shared memory access is still a performance/scalability concern because
 so many people want access to it at the same time. 
 
 There really is no need to do this after each block. 8 CPUs ought to be
 able to do 8 scans without tripping over each other. Especially if they
 are on separate tables.
 

Ok, I'll do it every 100 pages.

Regards,
Jeff Davis




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Hannu Krosing
Ühel kenal päeval, T, 2007-03-13 kell 12:53, kirjutas Tom Lane:
 Jeff Davis [EMAIL PROTECTED] writes:
  I agree that ss_report_loc() doesn't need to report on every call. If
  there's any significant overhead I agree that it should report less
  often. Do you think that the overhead is significant on such a simple
  function?
 
 One extra LWLock cycle per page processed definitely *is* a significant
 overhead ... can you say context swap storm?  I'd think about doing it
 once every 100 or so pages.

Can't we do it in some lock-free way ? writing page numbers (4-byte
ints) into a predetermined location isn shared mem should be atomic on
all platforms we support (still may cause some cache ping-pong in
multiprocessor systems, but this should be much cheaper), and even an
occasional error in establishing the scan head should not be
catastrophic.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Tue, 2007-03-13 at 17:11 +, Simon Riggs wrote:
 On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
  On Mon, 2007-03-12 at 13:21 +, Simon Riggs wrote:
   So based on those thoughts, sync_scan_offset should be fixed at 16,
   rather than being variable. In addition, ss_report_loc() should only
   report its position every 16 blocks, rather than do this every time,
   which will reduce overhead of this call.
  
  If we fix sync_scan_offset at 16, we might as well just get rid of it.
  Sync scans are only useful on large tables, and getting a free 16 pages
  over a scan isn't worth the trouble. However, even without
  sync_scan_offset, 
 
 Not sure what you mean by a free 16 pages. Please explain?
 

By free I mean already in cache, and therefore don't have to do I/O to
get it. I used the term loosely above, so let me re-explain:

My only point was that 16 is essentially 0 when it comes to
sync_scan_offset, because it's a small number of blocks over the course
of the scan of a large table.

If sync_scan_offset is 0, my patch will cause scans on a big table to
start where other scans are, and those scans should tend to stay
together and use newly-cached pages efficiently (and achieve the primary
goal of the patch).

The advantage of sync_scan_offset is that, in some situations, a second
scan can actually finish faster than if it were the only query
executing, because a previous scan has already caused some blocks to be
cached. However, 16 is a small number because that benefit would only be
once per scan, and sync scans are only helpful on large tables.

  I like the idea of reducing tuning parameters, but we should, at a
  minimum, still allow an on/off button for sync scans. My tests revealed
  that the wrong combination of OS/FS/IO-Scheduler/Controller could result
  in bad I/O behavior.
 
 Agreed
 

Do you have an opinion about sync_scan_threshold versus a simple
sync_scan_enable?

 I'd still like to be able to trace each scan to see how far ahead/behind
 it is from the other scans on the same table, however we do that.
 
 Any backend can read the position of other backend's scans, so it should

Where is that information stored? Right now my patch will overwrite the
hints of other backends, because I'm using a static data structure
(rather than one that grows). I do this to avoid the need for locking.

 be easy enough to put in a regular LOG entry that shows how far
 ahead/behind they are from other scans. We can trace just one backend
 and have it report on where it is with respect to other backends, or you
 could have them all calculate their position and have just the lead scan
 report the position of all other scans.
 

I already have each backend log it's progression through the tablescan
every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
currently use this information to see whether scans are staying together
or not. I think this gives us the information we need without backends
needing to communicate the information during execution.

I think I will increase the resolution of the scan progress so that we
can track every 5k or even 1k blocks read per pid per scan. That might
tell us more about the shared memory usage versus OS cache.

Is there any other information you need reported?

 I'd like to see the trace option to allow us to tell whether its working
 as well as we'd like it to pre-release and in production. Also I want to
 see whether various settings of scan_recycle_buffers help/hinder the
 effectiveness of synch scans, as others have worried it might.
 

Can you tell me what you mean by trace option, if you mean something
different than tracking the relative positions of the scans?

I will update my patch and send it along so that we can see how they
work together. 

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Simon Riggs
On Tue, 2007-03-13 at 11:28 -0700, Jeff Davis wrote:
 On Tue, 2007-03-13 at 17:11 +, Simon Riggs wrote:
  On Mon, 2007-03-12 at 17:46 -0700, Jeff Davis wrote:
   On Mon, 2007-03-12 at 13:21 +, Simon Riggs wrote:
So based on those thoughts, sync_scan_offset should be fixed at 16,
rather than being variable. In addition, ss_report_loc() should only
report its position every 16 blocks, rather than do this every time,
which will reduce overhead of this call.
   
   If we fix sync_scan_offset at 16, we might as well just get rid of it.
   Sync scans are only useful on large tables, and getting a free 16 pages
   over a scan isn't worth the trouble. However, even without
   sync_scan_offset, 
  
  Not sure what you mean by a free 16 pages. Please explain?
  
 
 By free I mean already in cache, and therefore don't have to do I/O to
 get it. I used the term loosely above, so let me re-explain:
 
 My only point was that 16 is essentially 0 when it comes to
 sync_scan_offset, because it's a small number of blocks over the course
 of the scan of a large table.
 
 If sync_scan_offset is 0, my patch will cause scans on a big table to
 start where other scans are, and those scans should tend to stay
 together and use newly-cached pages efficiently (and achieve the primary
 goal of the patch).

OK

 The advantage of sync_scan_offset is that, in some situations, a second
 scan can actually finish faster than if it were the only query
 executing, because a previous scan has already caused some blocks to be
 cached. However, 16 is a small number because that benefit would only be
 once per scan, and sync scans are only helpful on large tables.

Alright, understood. That last part is actually something I now want to
avoid because it's using the current cache-spoiling behaviour of
seqscans to advantage. I'd like to remove that behaviour, but it sounds
like we can have both
- SeqScans that don't spoil cache
- Synch scans
by setting sync_scan_offset to zero.

   I like the idea of reducing tuning parameters, but we should, at a
   minimum, still allow an on/off button for sync scans. My tests revealed
   that the wrong combination of OS/FS/IO-Scheduler/Controller could result
   in bad I/O behavior.
  
  Agreed
  
 
 Do you have an opinion about sync_scan_threshold versus a simple
 sync_scan_enable?

enable_sync_scan?

  I'd still like to be able to trace each scan to see how far ahead/behind
  it is from the other scans on the same table, however we do that.
  
  Any backend can read the position of other backend's scans, so it should
 
 Where is that information stored? Right now my patch will overwrite the
 hints of other backends, because I'm using a static data structure
 (rather than one that grows). I do this to avoid the need for locking.

OK, well, we can still read it before we overwrite it to calc the
difference. That will at least allow us to get a difference between
points as we go along. That seems like its worth having, even if it
isn't accurate for 3+ concurrent scans.

  be easy enough to put in a regular LOG entry that shows how far
  ahead/behind they are from other scans. We can trace just one backend
  and have it report on where it is with respect to other backends, or you
  could have them all calculate their position and have just the lead scan
  report the position of all other scans.
  
 
 I already have each backend log it's progression through the tablescan
 every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
 currently use this information to see whether scans are staying together
 or not. I think this gives us the information we need without backends
 needing to communicate the information during execution.

Well, that is good, thank you for adding that after initial discussions.

Does it have the time at which a particular numbered block is reached?
(i.e. Block #117 is not the same thing as the 117th block scanned). We
can use that to compare the time difference of each scan.

 I think I will increase the resolution of the scan progress so that we
 can track every 5k or even 1k blocks read per pid per scan. That might
 tell us more about the shared memory usage versus OS cache.
 
 Is there any other information you need reported?

Not sure yet! I just want to look one level deeper, to see if everything
is working like we think it should.

  I'd like to see the trace option to allow us to tell whether its working
  as well as we'd like it to pre-release and in production. Also I want to
  see whether various settings of scan_recycle_buffers help/hinder the
  effectiveness of synch scans, as others have worried it might.
  
 
 Can you tell me what you mean by trace option, if you mean something
 different than tracking the relative positions of the scans?
 
 I will update my patch and send it along so that we can see how they
 work together. 

Great

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end 

Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Jeff Davis
On Tue, 2007-03-13 at 19:24 +, Simon Riggs wrote:
  The advantage of sync_scan_offset is that, in some situations, a second
  scan can actually finish faster than if it were the only query
  executing, because a previous scan has already caused some blocks to be
  cached. However, 16 is a small number because that benefit would only be
  once per scan, and sync scans are only helpful on large tables.
 
 Alright, understood. That last part is actually something I now want to
 avoid because it's using the current cache-spoiling behaviour of
 seqscans to advantage. I'd like to remove that behaviour, but it sounds
 like we can have both
 - SeqScans that don't spoil cache
 - Synch scans
 by setting sync_scan_offset to zero.
 

Precisely.

If there is a cache-spoiling effect of the OS buffer cache that we want
to take advantage of, we could still set it to a non-zero value. But the
utility of sync_scan_offset does decrease with your patch, so removing
it altogether is a possibility (hopefully the numbers will tell us what
to do).


  Do you have an opinion about sync_scan_threshold versus a simple
  sync_scan_enable?
 
 enable_sync_scan?
 

After looking at other GUC names, I suggest that it's either
sync_scan (for on/off) or sync_scan_threshold (if we do want to
allow a numerical threshold). All the GUCs beginning with enable_ are
planner settings.

If we only allow on/off, we could probably just sync scan every table
because of your recycle_buffers patch.

   I'd still like to be able to trace each scan to see how far ahead/behind
   it is from the other scans on the same table, however we do that.
   
   Any backend can read the position of other backend's scans, so it should
  
  Where is that information stored? Right now my patch will overwrite the
  hints of other backends, because I'm using a static data structure
  (rather than one that grows). I do this to avoid the need for locking.
 
 OK, well, we can still read it before we overwrite it to calc the
 difference. That will at least allow us to get a difference between
 points as we go along. That seems like its worth having, even if it
 isn't accurate for 3+ concurrent scans.

Let me know if the things I list below don't cover what the information
you're looking for here. It would be easy for me to emit a log message
at the time it's overwriting the hint, but that would be a lot of noise:
every time ss_report_loc() is called, which we discussed would be once
per 100 pages read per scan. 

   be easy enough to put in a regular LOG entry that shows how far
   ahead/behind they are from other scans. We can trace just one backend
   and have it report on where it is with respect to other backends, or you
   could have them all calculate their position and have just the lead scan
   report the position of all other scans.
   
  
  I already have each backend log it's progression through the tablescan
  every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
  currently use this information to see whether scans are staying together
  or not. I think this gives us the information we need without backends
  needing to communicate the information during execution.
 
 Well, that is good, thank you for adding that after initial discussions.
 
 Does it have the time at which a particular numbered block is reached?
 (i.e. Block #117 is not the same thing as the 117th block scanned). We
 can use that to compare the time difference of each scan.

Right now it logs when a scan starts, what block number of the table it
starts on, and also prints out the current block it's scanning every N
blocks (100k or 10k depending on debug level). The time and the pid are,
of course, available from log_prefix.

I'll add the table OID to each log message in case we test, for example,
a single backend scanning multiple tables at once.

I'll also clean it up a bit, so that the information is a little easier
to grep out of the postgres logfiles and easier to analyze/graph.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-13 Thread Simon Riggs
On Tue, 2007-03-13 at 13:39 -0700, Jeff Davis wrote:

   Do you have an opinion about sync_scan_threshold versus a simple
   sync_scan_enable?
  
  enable_sync_scan?
  
 
 After looking at other GUC names, I suggest that it's either
 sync_scan (for on/off) or sync_scan_threshold (if we do want to
 allow a numerical threshold). All the GUCs beginning with enable_ are
 planner settings.

How about: sync_seqscans

so the phrase matches the equivalent enable_ parameter

 If we only allow on/off, we could probably just sync scan every table
 because of your recycle_buffers patch.

The buffer recycling only makes sense for large scans, so there's an
exact match for when both techniques need to kick-in.

I think I'd just lose this parameter and have it kick-in at either
NBuffers or NBuffers/2. We don't need another parameter...

I'm not planning to have scan_recycle_buffers continue into the
production version.

I'd still like to be able to trace each scan to see how far ahead/behind
it is from the other scans on the same table, however we do that.

Any backend can read the position of other backend's scans, so it should
   
   Where is that information stored? Right now my patch will overwrite the
   hints of other backends, because I'm using a static data structure
   (rather than one that grows). I do this to avoid the need for locking.
  
  OK, well, we can still read it before we overwrite it to calc the
  difference. That will at least allow us to get a difference between
  points as we go along. That seems like its worth having, even if it
  isn't accurate for 3+ concurrent scans.
 
 Let me know if the things I list below don't cover what the information
 you're looking for here. It would be easy for me to emit a log message
 at the time it's overwriting the hint, but that would be a lot of noise:
 every time ss_report_loc() is called, which we discussed would be once
 per 100 pages read per scan. 
 
be easy enough to put in a regular LOG entry that shows how far
ahead/behind they are from other scans. We can trace just one backend
and have it report on where it is with respect to other backends, or you
could have them all calculate their position and have just the lead scan
report the position of all other scans.

   
   I already have each backend log it's progression through the tablescan
   every 100k blocks to DEBUG (higher DEBUG gives every 10k blocks). I
   currently use this information to see whether scans are staying together
   or not. I think this gives us the information we need without backends
   needing to communicate the information during execution.
  
  Well, that is good, thank you for adding that after initial discussions.
  
  Does it have the time at which a particular numbered block is reached?
  (i.e. Block #117 is not the same thing as the 117th block scanned). We
  can use that to compare the time difference of each scan.
 
 Right now it logs when a scan starts, what block number of the table it
 starts on, and also prints out the current block it's scanning every N
 blocks (100k or 10k depending on debug level). The time and the pid are,
 of course, available from log_prefix.

Can you make it log every block whose id is divisible by 100k or 10k?

Otherwise one scan will log blocks 100,000... 200,000 ... etc and the
next scan will log 17357 117357 ... etc which will be much harder to
work out. That will give us lap times for every 100,000 blocks.

I'm particularly interested in the turning point where the scan starts
again at the beginning of the file. It would be good to know what
blockid it turned at and when that was. We may get out of step at that
point. Maybe. We'll find out.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Synchronized Scan update

2007-03-12 Thread Simon Riggs
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote:
 Is there any consensus about whether to include these two parameters as
 GUCs or constants if my patch is to be accepted?
 
 (1) sync_scan_threshold: Use synchronized scanning for tables greater
 than this many pages; smaller tables will not be affected.
 (2) sync_scan_offset: Start a new scan this many pages before a
 currently running scan to take advantage of the pages
  that are likely already in cache.
 
 Right now they are just constants defined in a header, but a GUC might
 make sense. I'd like to know which version is more acceptable when I
 submit my final patch.

I'm looking at ways of helping Synch Scan and scan_recycle_buffers to
work well together. I've had a look through the synch scan patch in more
detail to help this along.

ISTM they can play nicely together.

Results from tests show that with a single scan we get better
performance with scan_recycle_buffers = ~32. That gives a +25%
performance bonus. Synch scan has the capability to reduce I/O by 50%
across two concurrent scans, so that would be likely to outweigh the
gain from L2 cache reduction, unless we can get both together.

It's possible that multiple scans would use the same CPU, or at least
use multiple cores on the same chip and hence same L2 cache. That's more
likely if the synch scan works well, since the second scan can process
the buffer while the lead process performs I/O. So its likely that on
dual and quad core CPUs that we'll be able to get both benefits in most
cases.

So based on those thoughts, sync_scan_offset should be fixed at 16,
rather than being variable. In addition, ss_report_loc() should only
report its position every 16 blocks, rather than do this every time,
which will reduce overhead of this call.

To match that, scan_recycle_buffers should be fixed at 32. So GUCs for
sync_scan_offset and scan_recycle_buffers would not be required at all.

IMHO we can also remove sync_scan_threshold and just use NBuffers
instead. That way we get the benefit of both patches or neither, making
it easier to understand what's going on.

On my patch, I'll make buffer recycling only work for SeqScans and
VACUUMs.

If need be, the value of scan_recycle_buffers can be varied upwards
should the scans drift apart, as a way of bringing them back together.
We aren't tracking whether they are together or apart, so I would like
to see some debug output from synch scans to allow us to assess how far
behind the second scan is as it progresses. e.g.
LOG:  synch scan currently on block N, trailing pathfinder by M blocks
issued every 128 blocks as we go through the scans. 

Thoughts?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-12 Thread Luke Lonergan
Simon,

On 3/12/07 6:21 AM, Simon Riggs [EMAIL PROTECTED] wrote:

 So based on those thoughts, sync_scan_offset should be fixed at 16,
 rather than being variable. In addition, ss_report_loc() should only
 report its position every 16 blocks, rather than do this every time,
 which will reduce overhead of this call.

And for N concurrent scans?

I think there is actually no need to synchronize the shared buffers at all
for synchronized scans.  The OS I/O cache will do that for us and we're just
going to interfere and pessimize by trying to divide up a local buffer.

I suggest that this be proven or disproved by running this test: measure the
performance of syncscan with the non-polluting buffer change, then measure
with Jeff's patch and non-polluting with multiple scans, then measure with
your suggested changes to synchronize the buffers.

Somewhere in that progression we'll learn more about how the multi-level
buffering really works.  I think we'll get all the shared I/O cache we need.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Synchronized Scan update

2007-03-12 Thread Simon Riggs
On Mon, 2007-03-12 at 08:42 -0700, Luke Lonergan wrote:

 On 3/12/07 6:21 AM, Simon Riggs [EMAIL PROTECTED] wrote:
 
  So based on those thoughts, sync_scan_offset should be fixed at 16,
  rather than being variable. In addition, ss_report_loc() should only
  report its position every 16 blocks, rather than do this every time,
  which will reduce overhead of this call.
 
 And for N concurrent scans?
 
 I think there is actually no need to synchronize the shared buffers at all
 for synchronized scans.  The OS I/O cache will do that for us and we're just
 going to interfere and pessimize by trying to divide up a local buffer.

I think you've misunderstood my comment slightly.

In Jeff's patch, ss_report_loc() is called after every block is read,
AFAICS. I was suggesting that we don't do it that frequently, to reduce
the overhead of reporting the location. 

That has nothing to do with re-synchronising the two scans mid-way.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Synchronized Scan update

2007-03-07 Thread Jim Nasby

On Mar 6, 2007, at 9:43 AM, Josh Berkus wrote:

Don't get me wrong, I want things to be easily understandable as well
but the reason you site above pretty much
makes us need to remove most of the postgresql.conf, including all
bgwriter, vacuum cost delay, and autovac settings.
Not to mention commit delay and others ;).


Wouldn't that be nice!

The explosion of GUC settings is primarily a result of not enough  
information.
The reason there are 7 bgwriter settings, for example, is that we  
have no
idea what those settings should be and are hoping that people will  
tinker
with them and tell us.  Someday when I can fully profile bgwriter,  
we'll just
have one setting: bgwriter_aggressive, set to a number between 0  
and 9.


In the mean time; it would be great for these multiple-settings cases  
to be listed somewhere, indicating that it's something we could use  
help with. I think that with some explanation of what we're looking  
for there's any number of people who could do this kind of profiling.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-06 Thread Josh Berkus
JD,

 Don't get me wrong, I want things to be easily understandable as well
 but the reason you site above pretty much
 makes us need to remove most of the postgresql.conf, including all
 bgwriter, vacuum cost delay, and autovac settings.
 Not to mention commit delay and others ;).

Wouldn't that be nice!

The explosion of GUC settings is primarily a result of not enough information.  
The reason there are 7 bgwriter settings, for example, is that we have no 
idea what those settings should be and are hoping that people will tinker 
with them and tell us.  Someday when I can fully profile bgwriter, we'll just 
have one setting: bgwriter_aggressive, set to a number between 0 and 9.  

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Synchronized Scan update

2007-03-05 Thread Jeff Davis
On Sun, 2007-03-04 at 11:54 +, Simon Riggs wrote:
  (2) sync_scan_offset: Start a new scan this many pages before a
  currently running scan to take advantage of the pages
   that are likely already in cache.
 
 I'm somewhat dubious about this parameter, I have to say, even though I
 am eager for this feature. It seems like a magic parameter that works
 only when we have the right knowledge to set it correctly.
 

That was my concern about this parameter also.

 How will we know what to default it to and how will we know whether to
 set it higher or lower for better performance? Does that value vary
 according to the workload on the system? How?
 

Perhaps people would only set this parameter when they know it will
help, and for more complex (or varied) usage patterns they'd set
sync_scan_offset to 0 to be safe.

My thinking on the subject (and this is only backed up by very basic
tests) is that there are basically two situations where setting this
parameter too high can hurt:
(1) It's too close to the limits of your physical memory, and you end up
diverging the scans when they could be kept together.
(2) You're using a lot of CPU and the backends aren't processing the
buffers as fast as your I/O system is delivering them. This will prevent
the scans from converging.

If your CPUs are well below capacity and you choose a size significantly
less than your effective cache size, I don't think it will hurt.

 I'm worried that we get a feature that works well on simple tests and
 not at all in real world circumstances. I don't want to cast doubt on
 what could be a great patch or be negative: I just see that the feature
 relies on the dynamic behaviour of the system. I'd like to see some
 further studies on how this works to make sure that we can realistically
 set know how to set this knob, that its the correct knob and it is the
 only one we need.

I will do some better tests on some better hardware this week and next
week. I hope that sheds some light.

 Further thoughts: It sounds like sync_scan_offset is related to
 effective_cache_size. Can you comment on whether that might be a
 something we can use as well/instead? (i.e. set the scan offset to say K
 * effective_cache_size, 0.1 = K = 0.5)???
 
 Might we do roughly the same thing with sync_scan_threshold as well, and
 just have enable_sync_scan instead? i.e. sync_scan_threshold =
 effective_cache_size? When would those two parameters not be connected
 directly to each other?
 

Originally, these parameters were in terms of the effective_cache_size.
Somebody else convinced me that it was too confusing to have the
variables dependent on each other, so I made them independent. I don't
have a strong opinion either way.

Regards,
Jeff Davis





---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Synchronized Scan update

2007-03-04 Thread Simon Riggs
On Fri, 2007-03-02 at 15:03 -0800, Jeff Davis wrote:
 Is there any consensus about whether to include these two parameters as
 GUCs or constants if my patch is to be accepted?
 
 (1) sync_scan_threshold: Use synchronized scanning for tables greater
 than this many pages; smaller tables will not be affected.

That sounds OK.

 (2) sync_scan_offset: Start a new scan this many pages before a
 currently running scan to take advantage of the pages
  that are likely already in cache.

I'm somewhat dubious about this parameter, I have to say, even though I
am eager for this feature. It seems like a magic parameter that works
only when we have the right knowledge to set it correctly.

How will we know what to default it to and how will we know whether to
set it higher or lower for better performance? Does that value vary
according to the workload on the system? How?

I'm worried that we get a feature that works well on simple tests and
not at all in real world circumstances. I don't want to cast doubt on
what could be a great patch or be negative: I just see that the feature
relies on the dynamic behaviour of the system. I'd like to see some
further studies on how this works to make sure that we can realistically
set know how to set this knob, that its the correct knob and it is the
only one we need.

Further thoughts: It sounds like sync_scan_offset is related to
effective_cache_size. Can you comment on whether that might be a
something we can use as well/instead? (i.e. set the scan offset to say K
* effective_cache_size, 0.1 = K = 0.5)???

Might we do roughly the same thing with sync_scan_threshold as well, and
just have enable_sync_scan instead? i.e. sync_scan_threshold =
effective_cache_size? When would those two parameters not be connected
directly to each other?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-04 Thread Joshua D. Drake



(2) sync_scan_offset: Start a new scan this many pages before a
currently running scan to take advantage of the pages
 that are likely already in cache.



I'm somewhat dubious about this parameter, I have to say, even though I
am eager for this feature. It seems like a magic parameter that works
only when we have the right knowledge to set it correctly.

  

Hello,

Don't get me wrong, I want things to be easily understandable as well 
but the reason you site above pretty much
makes us need to remove most of the postgresql.conf, including all 
bgwriter, vacuum cost delay, and autovac settings.

Not to mention commit delay and others ;).

Sincerely,

Joshua D. Drake







---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Synchronized Scan update

2007-03-02 Thread Jeff Davis

Is there any consensus about whether to include these two parameters as
GUCs or constants if my patch is to be accepted?

(1) sync_scan_threshold: Use synchronized scanning for tables greater
than this many pages; smaller tables will not be affected.
(2) sync_scan_offset: Start a new scan this many pages before a
currently running scan to take advantage of the pages
 that are likely already in cache.

Right now they are just constants defined in a header, but a GUC might
make sense. I'd like to know which version is more acceptable when I
submit my final patch.

Regards,
Jeff Davis


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Synchronized Scan update

2007-03-02 Thread Josh Berkus
Jeff,

 Right now they are just constants defined in a header, but a GUC might
 make sense. I'd like to know which version is more acceptable when I
 submit my final patch.

As much as I hate the thought of more GUCs, until we have a solid 
performance profile for synch scan we probably need them.  You should 
include the option to turn synch_scan off, such as by setting 
synch_scan_threshold to -1.

Oh, and remember that these now need to be able to take K/MB/GB.

These options should probably go in postgresql.conf under QUERY TUNING, 
with their own sub-head.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Synchronized Scan update

2007-03-02 Thread Jeff Davis
On Fri, 2007-03-02 at 15:49 -0800, Josh Berkus wrote:
 Jeff,
 
  Right now they are just constants defined in a header, but a GUC might
  make sense. I'd like to know which version is more acceptable when I
  submit my final patch.
 
 As much as I hate the thought of more GUCs, until we have a solid 
 performance profile for synch scan we probably need them.  You should 

I will include them in the final patch then. 

 include the option to turn synch_scan off, such as by setting 
 synch_scan_threshold to -1.

Naturally.

 Oh, and remember that these now need to be able to take K/MB/GB.

Will do.

 These options should probably go in postgresql.conf under QUERY TUNING, 
 with their own sub-head.

That makes sense to me.

Regards,
Jeff Davis

PS: Did you happen to get my patch for testing (sent off-list)? If
testing will take a while, that's OK, I'd just like to know whether to
expect the results before feature freeze.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Synchronized Scan update

2007-03-02 Thread Josh Berkus
Jeff,

 PS: Did you happen to get my patch for testing (sent off-list)? If
 testing will take a while, that's OK, I'd just like to know whether to
 expect the results before feature freeze.

I'm not sure.  We have a bunch to patches in our queue to test, and the 
benchmark guys don't really expect synch scan to affect OLTP benchmarks 
much.  You might want to pester Greenplum about testing on TPCH.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Synchronized Scan update

2007-02-27 Thread Jeff Davis

I have found some interesting results from my tests with the
Synchronized Scan patch I'm working on.

The two benefits that I hope to achieve with the patch are:
(1) Better caching behavior with multiple sequential scans running in
parallel
(2) Faster sequential reads from disk and less seeking

I have consistently seen #1 to be true. There is still more testing to
be done (hopefully soon), but I haven't found a problem yet. And the
benefits I've seen are very substantial, which isn't hard, since in the
typical case, a large sequential scan will have 0% cache hit rate. These
numbers were retrieved using log_executor_stats=on.

#2 however, is a little trickier. IIRC, Tom was the first to point out
that the I/O system might not recognize that reads coming from different
processes are indeed one sequential read.

At first I never saw the problem actually happen, and I assumed that the
OS was being smart enough. However, recently I noticed this problem on
my home machine, which experienced great caching behavior but poor I/O
throughput (as measured by iostat). My home machine was using the Linux
CFQ io scheduler, and when I swapped the CFQ io scheduler for the
anticipatory scheduler (AS), it worked great. When I sent Josh my patch
(per his request) I mentioned the problem I experienced.

Then I started investigating, and found some mixed results. My test was
basically to use iostat (or zpool iostat) to measure disk throughput,
and N processes of dd if=bigfile of=/dev/null (started simultaneously)
to run the test. I consider the test to be passed if the additional
processes did not interfere (i.e. each process finished as though it
were the only one running). Of course, all tests were I/O bound.

My home machine (core 2 duo, single SATA disk, intel controller):
Linux/ext3/AS: passed
Linux/ext3/CFQ: failed
Linux/ext3/noop: passed
Linux/ext3/deadline: passed

Machine 2 (old thinkpad, IDE disk):
Solaris/UFS: failed
Solaris/ZFS: passed

Machine 3 (dell 2950, LSI PERC/5i controller, 6 SAS disks, RAID-10,
adaptive read ahead):
FreeBSD/UFS: failed

(I suspect the last test would be fine with read ahead always on, and it
may just be a problem with the adaptive read ahead feature)

There are a lot of factors involved, because several components of the
I/O system have the ability to reorder requests or read ahead, such as
the block layer and the controller. 

The block request ordering isn't the only factor because Solaris/UFS
only orders the requests by cylinder and moves only in one direction
(i.e. looks like a simple elevator algorithm that isn't affected by
process id). At least, that's how I understand it.

Readahead can't be the only factor either because replacing the io
scheduler in Linux solved the problem, even when that replacement was
the noop scheduler.

Anyway, back to the patch, it looks like there are some complications if
you try to use it with the wrong combination of fs, io scheduler, and
controller. 

The patch is designed for certain query patterns anyway, so I don't
think that this is a show-stopper. Given the better cache behavior, it
seems like it's really the job of the I/O system to get a single,
sequential stream of blocks efficiently.

The alternative would be to have a single block-reader process, which I
don't think we want to do. However, I/O systems don't really seem to
know how to handle multiple processes reading from the same file very
well.

Comments?

Regards,
Jeff Davis


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org