subject:"Re\: \[PERFORM\] I\/O on select count\(\*\)"

Re: [PERFORM] I/O on select count(*)

2008-05-24 Thread Decibel!


On May 18, 2008, at 1:28 AM, Greg Smith wrote:
I just collected all the good internals information included in  
this thread and popped it onto http://wiki.postgresql.org/wiki/ 
Hint_Bits where I'll continue to hack away at the text until it's  
readable.  Thanks to everyone who answered my questions here,  
that's good progress toward clearing up a very underdocumented area.


I note a couple of potential TODO items not on the official list  
yet that came up during this discussion:


-Smooth latency spikes when switching commit log pages by  
preallocating cleared pages before they are needed


-Improve bulk loading by setting frozen hint bits for tuple  
inserts which occur within the same database transaction as the  
creation of the table into which they're being inserted


Did I miss anything?  I think everything brought up falls either  
into one of those two or the existing Consider having the  
background writer update the transaction status hint bits... TODO.


-Evaluate impact of improved caching of CLOG per Greenplum:

Per Luke Longergan:
I'll find out if we can extract our code that did the work. It was  
simple but scattered in a few routines. In concept it worked like this:


1 - Ignore if hint bits are unset, use them if set.  This affects  
heapam and vacuum I think.
2 - implement a cache for clog lookups based on the optimistic  
assumption that the data was inserted in bulk.  Put the cache one  
call away from heapgetnext()


I forget the details of (2).  As I recall, if we fall off of the  
assumption, the penalty for long scans get large-ish (maybe 2X), but  
since when do people full table scan when they're updates/inserts are  
so scattered across TIDs?  It's an obvious big win for DW work.


--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature

Re: [PERFORM] I/O on select count(*)

2008-05-24 Thread Decibel!


On May 18, 2008, at 1:28 AM, Greg Smith wrote:
I just collected all the good internals information included in  
this thread and popped it onto http://wiki.postgresql.org/wiki/ 
Hint_Bits where I'll continue to hack away at the text until it's  
readable.  Thanks to everyone who answered my questions here,  
that's good progress toward clearing up a very underdocumented area.


I note a couple of potential TODO items not on the official list  
yet that came up during this discussion:


-Smooth latency spikes when switching commit log pages by  
preallocating cleared pages before they are needed


-Improve bulk loading by setting frozen hint bits for tuple  
inserts which occur within the same database transaction as the  
creation of the table into which they're being inserted


Did I miss anything?  I think everything brought up falls either  
into one of those two or the existing Consider having the  
background writer update the transaction status hint bits... TODO.


Blah, sorry for the double-post, but I just remembered a few things...

Did we completely kill the idea of the bg_writer *or some other  
background process* being responsible for setting all hint-bits on  
dirty pages before they're written out?


Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the  
back-end if the page isn't already dirty. We'd likely need some  
heuristics on this... based on Luke's comments about improved CLOG  
caching maybe we want to set the bits anyway if the tuples without  
them set are from old transactions (idea being that pulling those  
CLOG pages would be pretty expensive). Or better yet; if we have to  
read a CLOG page off disk, set the bits.


This could still potentially be a big disadvantage for data  
warehouses; though perhaps the way to fix that is recommend a  
backgrounded vacuum after data load.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature

Re: [PERFORM] I/O on select count(*)

2008-05-22 Thread Luke Lonergan

Hi Hannu,

Interesting suggestion on the partial index!

I'll find out if we can extract our code that did the work.  It was simple but 
scattered in a few routines.

In concept it worked like this:

1 - Ignore if hint bits are unset, use them if set.  This affects heapam and 
vacuum I think.
2 - implement a cache for clog lookups based on the optimistic assumption that 
the data was inserted in bulk.  Put the cache one call away from heapgetnext()

I forget the details of (2).  As I recall, if we fall off of the assumption, 
the penalty for long scans get large-ish (maybe 2X), but since when do people 
full table scan when they're updates/inserts are so scattered across TIDs?  
It's an obvious big win for DW work.

We also have a GUC to turn it off if needed, in which case a vacuum will write 
the hint bits.

- Luke

- Original Message -
From: Hannu Krosing [EMAIL PROTECTED]
To: Luke Lonergan
Cc: Pavan Deolasee [EMAIL PROTECTED]; Greg Smith [EMAIL PROTECTED]; Alvaro 
Herrera [EMAIL PROTECTED]; pgsql-performance@postgresql.org 
pgsql-performance@postgresql.org
Sent: Thu May 22 12:10:02 2008
Subject: Re: [PERFORM] I/O on select count(*)

On Thu, 2008-05-15 at 10:52 +0800, Luke Lonergan wrote:
 BTW – we’ve removed HINT bit checking in Greenplum DB and improved the
 visibility caching which was enough to provide performance at the same
 level as with the HINT bit optimization, but avoids this whole “write
 the data, write it to the log also, then write it again just for good
 measure” behavior.
 
 For people doing data warehousing work like the poster, this Postgres
 behavior is miserable.  It should be fixed for 8.4 for sure
 (volunteers?)

I might try it. I think I have told you about my ideas ;)
I plan to first do cacheing (for being able to doi index only scans
among other things) and then if the cache works reliably, use the
cacheing code as the main visibility / MVCC mechanism.

Is Greenplums code available, or should I roll my own ?

 BTW – for the poster’s benefit, you should implement partitioning by
 date, then load each partition and VACUUM ANALYZE after each load.
  You probably won’t need the date index anymore – so your load times
 will vastly improve (no indexes), you’ll store less data (no indexes)
 and you’ll be able to do simpler data management with the partitions.
 
 You may also want to partition AND index if you do a lot of short
 range selective date predicates.  Example would be: partition by day,
 index on date field, queries selective on date ranges by hour will
 then select out only the day needed, then index scan to get the 
 hourly values.

If your queries allow it, you may try indexing on 
int2::extract('HOUR' from date)
so the index may be smaller

storing the date as type abstime is another way to reduce index size.

 Typically time-oriented data is nearly time sorted anyway, so you’ll
 also get the benefit of a clustered index.


Hannu

Re: [PERFORM] I/O on select count(*)

2008-05-19 Thread Matthew Wakeling


Alvaro Herrera [EMAIL PROTECTED] writes:

pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
zeroed, which is the bit pattern for transaction in progress.  So when
a transaction starts, it only needs to ensure that the pg_clog page that
corresponds to it is allocated, but it need not write anything to it.


Yeah, that's pretty-much how I imagined it would be. Nice and compact. I 
would imagine that if there are only a few transactions, doing the pg_clog 
lookup would be remarkably quick. However, once there have been a 
bazillion transactions, with tuples pointing to the whole range of them, 
it would degenerate into having to perform an extra seek for each tuple, 
and that's why you added the hint bits.


On Fri, 16 May 2008, Tom Lane wrote:

One additional point: this means that one transaction in every 32K
writing transactions *does* have to do extra work when it assigns itself
an XID, namely create and zero out the next page of pg_clog.  And that
doesn't just slow down the transaction in question, but the next few
guys that would like an XID but arrive on the scene while the
zeroing-out is still in progress.

This probably contributes to the behavior that Simon and Josh regularly
complain about, that our transaction execution time is subject to
unpredictable spikes.  I'm not sure how to get rid of it though.


Does it really take that long to zero out 8kB of RAM? I thought CPUs were 
really quick at doing that!


Anyway, the main thing you need to avoid is all the rest of the 
transactions waiting for the new pg_clog page. The trick is to generate 
the new page early, outside any locks on existing pages. It doesn't 
necessarily need to be done by a daemon thread at all.


Matthew

--
I'm NOT paranoid!  Which of my enemies told you this?

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-19 Thread Bruce Momjian

Matthew Wakeling wrote:
 On Fri, 16 May 2008, Tom Lane wrote:
  One additional point: this means that one transaction in every 32K
  writing transactions *does* have to do extra work when it assigns itself
  an XID, namely create and zero out the next page of pg_clog.  And that
  doesn't just slow down the transaction in question, but the next few
  guys that would like an XID but arrive on the scene while the
  zeroing-out is still in progress.
 
  This probably contributes to the behavior that Simon and Josh regularly
  complain about, that our transaction execution time is subject to
  unpredictable spikes.  I'm not sure how to get rid of it though.
 
 Does it really take that long to zero out 8kB of RAM? I thought CPUs were 
 really quick at doing that!

Yea, that was my assumption too.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-19 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Matthew Wakeling wrote:
 Does it really take that long to zero out 8kB of RAM? I thought CPUs were 
 really quick at doing that!

 Yea, that was my assumption too.

You have to write the page (to be sure there is space for it on disk)
not only zero it.

This design is kind of a holdover, though, from back when we had one
ever-growing clog file.  Today I'd be inclined to think about managing
it more like pg_xlog, ie, have some background process pre-create a
whole segment file at a time.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-19 Thread Greg Smith


On Mon, 19 May 2008, Matthew Wakeling wrote:

Does it really take that long to zero out 8kB of RAM? I thought CPUs were 
really quick at doing that!


You don't get the whole CPU--you get time slices of one.  Some of the 
cases complaints have come in about have over a thousand connections all 
fighting for CPU time, and making every one of them block for one guy who 
needs to fiddle with memory for a while can be a problem.  If you're 
unlucky you won't even be on the same CPU you started on each time you get 
a little check of time, and you'll run at the speed of RAM rather than 
that of the CPU--again, fighting for RAM access with every other process 
on the server.


The real question in my mind is why this turns into a bottleneck before 
the similar task of cleaning the 16MB XLOG segment does.  I expected that 
one would need to be cracked before the CLOG switch time could possibly be 
an issue, but reports from the field seem to suggest otherwise.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-19 Thread Tom Lane

Greg Smith [EMAIL PROTECTED] writes:
 The real question in my mind is why this turns into a bottleneck before 
 the similar task of cleaning the 16MB XLOG segment does.

Because we do the latter off-line, or at least try to.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-17 Thread Greg Smith

I just collected all the good internals information included in this 
thread and popped it onto http://wiki.postgresql.org/wiki/Hint_Bits where 
I'll continue to hack away at the text until it's readable.  Thanks to 
everyone who answered my questions here, that's good progress toward 
clearing up a very underdocumented area.


I note a couple of potential TODO items not on the official list yet that 
came up during this discussion:


-Smooth latency spikes when switching commit log pages by preallocating 
cleared pages before they are needed


-Improve bulk loading by setting frozen hint bits for tuple inserts 
which occur within the same database transaction as the creation of the 
table into which they're being inserted


Did I miss anything?  I think everything brought up falls either into one 
of those two or the existing Consider having the background writer update 
the transaction status hint bits... TODO.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-16 Thread Greg Smith


On Thu, 15 May 2008, Alvaro Herrera wrote:


Starting a transaction does not write anything to pg_clog.


For Matt and others, some details here are in 
src/backend/access/transam/README:


pg_clog records the commit status for each transaction that has been 
assigned an XID.


Transactions and subtransactions are assigned permanent XIDs only when/if 
they first do something that requires one --- typically, 
insert/update/delete a tuple, though there are a few other places that 
need an XID assigned.


After reading the code and that documentation a bit, the part I'm still 
not sure about is whether the CLOG entry is created when the XID is 
assigned and then kept current as the state changes, or whether that isn't 
even in CLOG until the transaction is committed.  It seems like the 
latter, but there's some ambiguity in the wording and too many code paths 
for me to map right now.


From there, it doesn't make its way out to disk until the internal CLOG 
buffers are filled, at which point the least recently used buffer there is 
evicted to permanent storage.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-16 Thread Simon Riggs


On Fri, 2008-05-16 at 14:05 -0400, Greg Smith wrote:
 After reading the code and that documentation a bit, the part I'm
 still not sure about is whether the CLOG entry is created when the XID
 is assigned and then kept current as the state changes, or whether
 that isn't even in CLOG until the transaction is committed.  It seems
 like the latter, but there's some ambiguity in the wording and too
 many code paths for me to map right now.

Alvaro already said this, I thought? The clog is updated only at sub or
main transaction end, thank goodness. When the transactionid is assigned
the page of the clog that contains that transactionid is checked to see
if it already exists and if not, it is initialised.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-16 Thread Alvaro Herrera

Greg Smith wrote:

 After reading the code and that documentation a bit, the part I'm still  
 not sure about is whether the CLOG entry is created when the XID is  
 assigned and then kept current as the state changes, or whether that 
 isn't even in CLOG until the transaction is committed.  It seems like the 
 latter, but there's some ambiguity in the wording and too many code paths 
 for me to map right now.

pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
zeroed, which is the bit pattern for transaction in progress.  So when
a transaction starts, it only needs to ensure that the pg_clog page that
corresponds to it is allocated, but it need not write anything to it.

(*) Each transaction needs 2 bits, so on a 8 kB page there is space for
4 transactions/byte * 8 pages * 1kB/page = 32k transactions.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-16 Thread Alvaro Herrera

Alvaro Herrera wrote:

 pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
 zeroed, which is the bit pattern for transaction in progress.  So when
 a transaction starts, it only needs to ensure that the pg_clog page that
 corresponds to it is allocated, but it need not write anything to it.

Of course, in 8.3 it's not when the transaction starts, but when the Xid
is assigned (i.e. when the transaction first calls a read-write
command).  In previous versions it happens when the first snapshot is
taken (i.e. normally on the first command of any type, with very few
exceptions.)

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-16 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Greg Smith wrote:
 After reading the code and that documentation a bit, the part I'm still  
 not sure about is whether the CLOG entry is created when the XID is  
 assigned and then kept current as the state changes, or whether that 
 isn't even in CLOG until the transaction is committed.

 pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
 zeroed, which is the bit pattern for transaction in progress.  So when
 a transaction starts, it only needs to ensure that the pg_clog page that
 corresponds to it is allocated, but it need not write anything to it.

One additional point: this means that one transaction in every 32K
writing transactions *does* have to do extra work when it assigns itself
an XID, namely create and zero out the next page of pg_clog.  And that
doesn't just slow down the transaction in question, but the next few
guys that would like an XID but arrive on the scene while the
zeroing-out is still in progress.

This probably contributes to the behavior that Simon and Josh regularly
complain about, that our transaction execution time is subject to
unpredictable spikes.  I'm not sure how to get rid of it though.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tino Wildenhain


Hi,

Luke Lonergan wrote:
BTW – we’ve removed HINT bit checking in Greenplum DB and improved the 
visibility caching which was enough to provide performance at the same 
level as with the HINT bit optimization, but avoids this whole “write 
the data, write it to the log also, then write it again just for good 
measure” behavior.


can you go a bit deeper into how you implemented this or is it some IP
of greenplum you cannot reveal?

Btw, is there something with your eyes:
FONT SIZE=4FONT FACE=Verdana, Helvetica, ArialSPAN 
STYLE='font-size:14pt' ? :-))


Cheers
Tino



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane

Jan de Visser [EMAIL PROTECTED] writes:
 Obviously, this issue is tied to the slow count(*) one, as I found out
 the hard way. Consider the following scenario:
 * Insert row
 * Update that row a couple of times
 * Rinse and repeat many times

 Now somewhere during that cycle, do a select count(*) just to see
 where you are. You will be appalled by how slow that is, due to not
 only the usual 'slow count(*)' reasons. This whole hint bit business
 makes it even worse, as demonstrated by the fact that running a vacuum
 before the count(*) makes the latter noticably faster.

Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
has simply *removed* two-thirds of the rows in the system, resulting in
a large drop in the number of rows that the select even has to look at.

It's certainly true that hint-bit updates cost something, but
quantifying how much isn't easy.  The off-the-cuff answer is to do the
select count(*) twice and see how much cheaper the second one is.  But
there are two big holes in that answer: the first is the possible cache
effects from having already read in the pages, and the second is that
the follow-up scan gets to avoid the visits to pg_clog that the first
scan had to make (which after all is the point of the hint bits).

I don't know any easy way to disambiguate the three effects that are at
work here.  But blaming it all on the costs of writing out hint-bit
updates is wrong.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling


On Thu, 15 May 2008, Luke Lonergan wrote:

BTW  we¹ve removed HINT bit checking in Greenplum DB and improved the
visibility caching which was enough to provide performance at the same level
as with the HINT bit optimization, but avoids this whole ³write the data,
write it to the log also, then write it again just for good measure²
behavior.


This sounds like a good option. I believe I suggested this a few months 
ago, however it was rejected because in the worst case (when the hints are 
not cached), if you're doing an index scan, you can do twice the number of 
seeks as before.


http://archives.postgresql.org/pgsql-performance/2007-12/msg00217.php

The hint data will be four bits per tuple plus overheads, so it could be 
made very compact, and therefore likely to stay in the cache fairly well. 
Each tuple fetched would have to be spaced really far apart in the 
database table in order to exhibit the worst case, because fetching a page 
of hint cache will cause 64kB or so of disc to appear in the disc's 
read-ahead buffer, which will be equivalent to 128MB worth of database 
table (assuming eight tuples per block and no overhead). As soon as you 
access another tuple in the same 128MB bracket, you'll hit the disc 
read-ahead buffer for the hints.


On balance, to me it still seems like a good option.

Matthew

--
Those who do not understand Unix are condemned to reinvent it, poorly.
   -- Henry Spencer
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Jan de Visser

On Thursday 15 May 2008 03:02:19 Tom Lane wrote:
 Jan de Visser [EMAIL PROTECTED] writes:
  Obviously, this issue is tied to the slow count(*) one, as I found out
  the hard way. Consider the following scenario:
  * Insert row
  * Update that row a couple of times
  * Rinse and repeat many times
 
  Now somewhere during that cycle, do a select count(*) just to see
  where you are. You will be appalled by how slow that is, due to not
  only the usual 'slow count(*)' reasons. This whole hint bit business
  makes it even worse, as demonstrated by the fact that running a vacuum
  before the count(*) makes the latter noticably faster.

 Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
 has simply *removed* two-thirds of the rows in the system, resulting in
 a large drop in the number of rows that the select even has to look at.

 It's certainly true that hint-bit updates cost something, but
 quantifying how much isn't easy.  The off-the-cuff answer is to do the
 select count(*) twice and see how much cheaper the second one is.  But
 there are two big holes in that answer: the first is the possible cache
 effects from having already read in the pages, and the second is that
 the follow-up scan gets to avoid the visits to pg_clog that the first
 scan had to make (which after all is the point of the hint bits).

 I don't know any easy way to disambiguate the three effects that are at
 work here.  But blaming it all on the costs of writing out hint-bit
 updates is wrong.

   regards, tom lane

True. But it still contributes to the fact that queries sometimes behave in a 
non-deterministic way, which IMHO is the major annoyance when starting to 
work with pgsql. And contrary to other causes (vacuum, checkpoints) this is 
woefully underdocumented.

jan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling


On Thu, 15 May 2008, Heikki Linnakangas wrote:
 Is it really safe to update the hint bits in place? If there is a 
 power cut in the middle of writing a block, is there a guarantee from 
 the disc that the block will never be garbled?


Don't know, to be honest. We've never seen any reports of corrupted data 
that would suggest such a problem, but it doesn't seem impossible to me 
that some exotic storage system might do that.


Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
it? So basically, if you're telling people that WAL full-page-writes is 
safer than partial WAL, because it avoids updating pages in-place, then 
you shouldn't be updating pages in-place for the hint bits either. You 
can't win!


In fact, if the tuple's creating transaction has aborted, then the tuple 
can be vacuumed right there and then before it is even written. 


Not if you have any indexes on the table. To vacuum, you'll have to scan all 
indexes to remove pointers to the tuple.


Ah. Well, would that be so expensive? After all, someone has to do it 
eventually, and these are index entries that have only just been added 
anyway.


I can understand index updating being a bit messy in the middle of a 
checkpoint though, as you would have to write the update to the WAL, which 
you are checkpointing...


So, I don't know exactly how the WAL updates to indexes work, but my guess 
is that it has been implemented as write the blocks that we would change 
to the WAL. The problem with this is that all the changes to the index 
are done individually, so there's no easy way to undo one of them later 
on when you find out that the transaction has been aborted during the 
checkpoint.


An alternative would be to build a list of changes in the WAL without 
actually changing the underlying index at all. When reading the index, you 
would read the list first (which would be in memory, and in an 
efficient-to-search structure), then read the original index and add the 
two. Then when checkpointing, vet all the changes against known aborted 
transactions before making all the changes to the index together. This is 
likely to speed up index writes quite a bit, and also allow you to 
effectively vacuum aborted tuples before they get written to the disc.


Matthew

--
Vacuums are nothings. We only mention them to let them know we know
they're there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Ron Mayer


Matthew Wakeling wrote:

On Thu, 15 May 2008, Luke Lonergan wrote:

...HINT bit optimization, but avoids this whole ³write the data,
write it to the log also, then write it again just for good measure²

...
The hint data will be four bits per tuple plus overheads, so it could be 
made very compact, and therefore likely to stay in the cache fairly 
well. 


Does it seem like these HINT bits would be good candidates to move
off to map forks similar to how the visibility map stuff will be handled?

Since (if I understand right) only the hint bits change during the
select(*) it seems a lot less write-IO would happen if such a map
were updated rather than the data pages themselves.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera

Greg Smith escribió:
 On Thu, 15 May 2008, Pavan Deolasee wrote:

 I had suggested in the past that whenever we set hint bits for a tuple, 
 we should check all other tuples in the page and set their hint bits 
 too to avoid multiple writes of the same page. I guess the idea got 
 rejected because of lack of benchmarks to prove the benefit.

 From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the  
 impression the idea was to have the background writer get involved to 
 help with this particular situation.

The problem is that the bgwriter does not understand about the content
of the pages it is writing -- they're opaque pages for all it knows.  So
it cannot touch the hint bits.

I agree with Pavan that it's likely that setting hint bits in batches
instead of just for the tuple being examined is a benefit.  However,
it's perhaps not so good to be doing it in a foreground process, because
you're imposing extra cost to the client queries which we want to be as
fast as possible.  Perhaps the thing to do is have a database-local
bgwriter which would scan pages and do this kind of change ...
a different kind of process to be launched by autovacuum perhaps.

If we had the bitmask in a separate map fork, this could be cheap.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane

Matthew Wakeling [EMAIL PROTECTED] writes:
 Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
 it? So basically, if you're telling people that WAL full-page-writes is 
 safer than partial WAL, because it avoids updating pages in-place, then 
 you shouldn't be updating pages in-place for the hint bits either. You 
 can't win!

This argument ignores the nature of the data change.  With a hint-bit
update, no data is being shuffled around, so there is no danger from a
partial page write.

A disk that leaves an individual sector corrupt would be a problem,
but I don't think that's a huge risk.  Keep in mind that disks aren't
designed to just stop dead when power dies --- they are made to be able
to park their heads before the juice is entirely gone.  I think it's
reasonable to assume they'll finish writing the sector in progress
before they start parking.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Heikki Linnakangas


Matthew Wakeling wrote:

On Thu, 15 May 2008, Heikki Linnakangas wrote:
 Is it really safe to update the hint bits in place? If there is a  
power cut in the middle of writing a block, is there a guarantee from 
 the disc that the block will never be garbled?


Don't know, to be honest. We've never seen any reports of corrupted 
data that would suggest such a problem, but it doesn't seem impossible 
to me that some exotic storage system might do that.


Hmm. That problem is what WAL full-page-writes is meant to handle, isn't 
it? So basically, if you're telling people that WAL full-page-writes is 
safer than partial WAL, because it avoids updating pages in-place, then 
you shouldn't be updating pages in-place for the hint bits either. You 
can't win!


Full-page-writes protect from torn pages, that is, when one half of an 
update hits the disk but the other one doesn't. In particular, if the 
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to 
disk, but the actual changes elsewhere in the page aren't, you're in 
trouble. WAL replay will look at the WAL pointer, and think that the 
page doesn't need to be replayed, while other half of the update is 
still missing.


Hint bits are different. We're only updating a single bit, and it 
doesn't matter from correctness point of view whether the hint bit 
update hits the disk or not. But what would spell trouble is if the disk 
controller/whatever garbles the whole sector, IOW changes something else 
than the changed bit, while doing the update.


In fact, if the tuple's creating transaction has aborted, then the 
tuple can be vacuumed right there and then before it is even written. 


Not if you have any indexes on the table. To vacuum, you'll have to 
scan all indexes to remove pointers to the tuple.


Ah. Well, would that be so expensive? After all, someone has to do it 
eventually, and these are index entries that have only just been added 
anyway.


Scanning all indexes? Depends on your table of course, but yes it would 
be expensive in general.


An alternative would be to build a list of changes in the WAL without 
actually changing the underlying index at all. When reading the index, 
you would read the list first (which would be in memory, and in an 
efficient-to-search structure), then read the original index and add the 
two. Then when checkpointing, vet all the changes against known aborted 
transactions before making all the changes to the index together. This 
is likely to speed up index writes quite a bit, and also allow you to 
effectively vacuum aborted tuples before they get written to the disc.


There's not much point optimizing something that only helps with aborted 
transactions.


The general problem with any idea that involves keeping a list of 
changes made in a transaction is that that list will grow big during 
bulk loads, so you'll have to overflow to disk or abandon the list 
approach. Which means that it won't help with bulk loads.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Heikki Linnakangas


Alvaro Herrera wrote:

Greg Smith escribió:

On Thu, 15 May 2008, Pavan Deolasee wrote:

I had suggested in the past that whenever we set hint bits for a tuple, 
we should check all other tuples in the page and set their hint bits 
too to avoid multiple writes of the same page. I guess the idea got 
rejected because of lack of benchmarks to prove the benefit.


From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the  
impression the idea was to have the background writer get involved to 
help with this particular situation.


The problem is that the bgwriter does not understand about the content
of the pages it is writing -- they're opaque pages for all it knows.  So
it cannot touch the hint bits.


We know what kind of a relation we're dealing with in ReadBuffer, so we 
could add a flag to BufferDesc to mark heap pages.



If we had the bitmask in a separate map fork, this could be cheap.


I don't buy that. The point of a hint bit is that it's right there along 
with the tuple you're looking at. If you have to look at a separate 
buffer, you might as well just look at clog.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Matthew Wakeling


On Thu, 15 May 2008, Heikki Linnakangas wrote:
There's not much point optimizing something that only helps with aborted 
transactions.


That's fair enough, but this list method is likely to speed up index 
writes anyway.


The general problem with any idea that involves keeping a list of changes 
made in a transaction is that that list will grow big during bulk loads, so 
you'll have to overflow to disk or abandon the list approach. Which means 
that it won't help with bulk loads.


Yeah, it wouldn't be a list of changes for the transaction, it would be a 
list of changes since the last checkpoint. Keeping data in memory for the 
length of the transaction is doomed to failure, because there is no bound 
on its size, so bulk loads are still going to miss out on hint 
optimisation.


Matthew

--
for a in past present future; do
 for b in clients employers associates relatives neighbours pets; do
 echo The opinions here in no way reflect the opinions of my $a $b.
done; done

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera

Heikki Linnakangas escribió:
 Alvaro Herrera wrote:

 The problem is that the bgwriter does not understand about the content
 of the pages it is writing -- they're opaque pages for all it knows.  So
 it cannot touch the hint bits.

 We know what kind of a relation we're dealing with in ReadBuffer, so we  
 could add a flag to BufferDesc to mark heap pages.

Hmm, I was thinking that it would need access to the catalogs to know
where the tuples are, but that's certainly not true, so perhaps it could
be made to work.

 If we had the bitmask in a separate map fork, this could be cheap.

 I don't buy that. The point of a hint bit is that it's right there along  
 with the tuple you're looking at. If you have to look at a separate  
 buffer, you might as well just look at clog.

True -- I was confusing this with the idea of having the tuple MVCC
header (xmin, xmax, etc) in a separate fork, which would make the idea
of index-only scans more feasible at the expense of seqscans.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Alvaro Herrera

Matthew Wakeling wrote:

Aside from the rest of commentary, a slight clarification:

 So, as I understand it, Postgres works like this:

 1. You begin a transaction. Postgres writes an entry into pg_clog.

Starting a transaction does not write anything to pg_clog.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Heikki Linnakangas escribió:
 We know what kind of a relation we're dealing with in ReadBuffer, so we  
 could add a flag to BufferDesc to mark heap pages.

 Hmm, I was thinking that it would need access to the catalogs to know
 where the tuples are, but that's certainly not true, so perhaps it could
 be made to work.

The issue in my mind is not so much could bgwriter physically do it
as that it's a violation of module layering.  That has real
consequences, like potential for deadlocks.  It'll become particularly
pressing if we go forward with the plans to get rid of the separate
dedicated buffers for pg_clog etc and have them work in the main
shared-buffer pool.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Robert Lor


Tom Lane wrote:

It's certainly true that hint-bit updates cost something, but
quantifying how much isn't easy. 
Maybe we can instrument the code with DTrace probes to quantify the 
actual costs.  I'm not familiar with the code, but if I know where to 
place the probes, I can easily do a quick test and provide the data.

 The off-the-cuff answer is to do the
select count(*) twice and see how much cheaper the second one is. 
Doesn't seem the second run is cheaper as shown in the results below. 
The data came from the probes I've added recently.


*** Run #1 **
SQL Statement  : select count(*) from accounts;
Execution time : 1086.58 (ms)

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384   1247  1
 1663  16384   2600  1
 1663  16384   2703  1
 1663  16384   1255  2
 1663  16384   2650  2
 1663  16384   2690  3
 1663  16384   2691  3
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16397   2865

Total buffer cache hits  :  1932
Total buffer cache misses:  6471
Average read time from cache :  5638 (ns)
Average read time from disk  :143371 (ns)
Average write time to disk   : 20368 (ns)


*** Run #2 **
SQL Statement  : select count(*) from accounts;
Execution time : 1115.94 (ms)

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16397   2865

Total buffer cache hits  :  1931
Total buffer cache misses:  6459
Average read time from cache :  4357 (ns)
Average read time from disk  :154127 (ns)
Average write time to disk   : 20368 (ns)


-Robert

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Tom Lane

Robert Lor [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 It's certainly true that hint-bit updates cost something, but
 quantifying how much isn't easy. 

 Maybe we can instrument the code with DTrace probes to quantify the 
 actual costs.

Hmm, the problem would be trying to figure out what percentage of writes
could be blamed solely on hint-bit updates and not any other change to
the page.  I don't think that the bufmgr currently keeps enough state to
know that, but you could probably modify it easily enough, since callers
distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
another flag bit that's set only by the first, and test it during
write-out.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Robert Lor


Tom Lane wrote:

Hmm, the problem would be trying to figure out what percentage of writes
could be blamed solely on hint-bit updates and not any other change to
the page.  I don't think that the bufmgr currently keeps enough state to
know that, but you could probably modify it easily enough, since callers
distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
another flag bit that's set only by the first, and test it during
write-out.
  
Ok, I made a few changes to bufmgr per my understanding of your 
description above and with my limited understanding of the code. Patch 
is attached.


Assuming the patch is correct, the effect of writes due to hint bits is 
quite significant. I collected the data below by runing pgbench in one 
terminal and psql on another to run the query.


Is the data plausible?

-Robert

--


Backend PID: 16189
SQL Statement  : select count(*) from accounts;
Execution time : 17.33 sec

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384   2600  1
 1663  16384   2601  1
 1663  16384   2615  1
 1663  16384   1255  2
 1663  16384   2602  2
 1663  16384   2603  2
 1663  16384   2616  2
 1663  16384   2650  2
 1663  16384   2678  2
 1663  16384   1247  3
 1663  16384   1249  3
 1663  16384   2610  3
 1663  16384   2655  3
 1663  16384   2679  3
 1663  16384   2684  3
 1663  16384   2687  3
 1663  16384   2690  3
 1663  16384   2691  3
 1663  16384   2703  4
 1663  16384   1259  5
 1663  16384   2653  5
 1663  16384   2662  5
 1663  16384   2663  5
 1663  16384   2659  7
 1663  16384  16397   8390

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16402  2
 1663  16384  16394 11
 1663  16384  16397   4771

== Hint Bits Write Counts ==
Tablespace   Database  Table  Count
 1663  16384  16397   4508

Total buffer cache hits  :   732
Total buffer cache misses:  7731
Average read time from cache :  9136 (ns)
Average read time from disk  :384201 (ns)
Average write time to disk   :210709 (ns)


Backend PID: 16189
SQL Statement  : select count(*) from accounts;
Execution time : 12.72 sec

 Buffer Read Counts 
Tablespace   Database  Table  Count
 1663  16384  16397   8392

 Dirty Buffer Write Counts =
Tablespace   Database  Table  Count
 1663  16384  16394  6
 1663  16384  16402  7
 1663  16384  16397   2870

== Hint Bits Write Counts ==
Tablespace   Database  Table  Count
 1663  16384  16402  2
 1663  16384  16397   2010

Total buffer cache hits  :   606
Total buffer cache misses:  7786
Average read time from cache :  6949 (ns)
Average read time from disk  :706288 (ns)
Average write time to disk   : 90426 (ns)

Index: bufmgr.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -u -3 -p -r1.228 bufmgr.c
--- bufmgr.c	1 Jan 2008 19:45:51 -	1.228
+++ bufmgr.c	15 May 2008 20:56:38 -
@@ -42,6 +42,7 @@
 #include storage/smgr.h
 #include utils/resowner.h
 #include pgstat.h
+#include pg_trace.h
 
 
 /* Note: these two macros only work on shared buffers, not local ones! */
@@ -171,6 +172,7 @@ ReadBuffer_common(Relation reln, BlockNu
 	if (isExtend)
 		blockNum = smgrnblocks(reln-rd_smgr);
 
+	TRACE_POSTGRESQL_BUFFER_READ_START(blockNum, reln-rd_node.spcNode, reln-rd_node.dbNode, reln-rd_node.relNode, isLocalBuf);
 	pgstat_count_buffer_read(reln);
 
 	if (isLocalBuf)
@@ -200,12 +202,16 @@ ReadBuffer_common(Relation reln, BlockNu
 	{
 		if (!isExtend)
 		{
+			TRACE_POSTGRESQL_BUFFER_HIT();
 			/* Just need to update stats before we exit */
 			pgstat_count_buffer_hit(reln);
 
 			if (VacuumCostActive)
 VacuumCostBalance += VacuumCostPageHit;
 
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(blockNum,
+ reln-rd_node.spcNode, reln-rd_node.dbNode,
+ reln-rd_node.relNode, isLocalBuf, found);
 			return BufferDescriptorGetBuffer(bufHdr);
 		}
 
@@ -257,6 +263,7 @@ ReadBuffer_common(Relation reln, BlockNu
 			} while (!StartBufferIO(bufHdr, true));
 		}
 	}
+	TRACE_POSTGRESQL_BUFFER_MISS();
 
 	/*
 	 * if we have gotten to this

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread James Mansion


Alvaro Herrera wrote:

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.  On the other
  
So, how come there is this outstanding work to do, which will inevitably 
be done, and it
hasn't been done until it is 'just too late' to avoid getting in the way 
of the query?


The OP didn't suggest that he had just loaded the data.

Also - is it the case that this only affects the case where updated 
pages were spilled
during the transaction that changed them?  ie, if we commit a 
transaction and there
are changed rows still in the cache since their pages are not evicted 
yet, are the hint
bits set immediately so that page is written just once?  Seems this 
would be common

in most OLTP systems.

Heikki points out that the list might get big and need to be abandoned, 
but then you
fall back to scheduling a clog scan that can apply the bits, which does 
what you have
now, though hopefully in a way that fills slack disk IO rather than 
waiting for the

read.

Matthew says: 'it would be a list of changes since the last checkpoint' 
but I don't
see why you can't start writing hints to in-memory pages as soon as the 
transaction

ends.  You might fall behind, but I doubt it with modern CPU speeds.

I can't see why Pavan's suggestion to try to update as many of the bits 
as possible

when a dirty page is evicted would be contentious.

I do think this is something of interest to users, not just developers, 
since it

may influence the way updates are processed where it is reasonable to do
so in 'bite sized chunks' as a multipart workflow.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Gregory Stark

Luke Lonergan [EMAIL PROTECTED] writes:

 BTW  we¹ve removed HINT bit checking in Greenplum DB and improved the
 visibility caching which was enough to provide performance at the same level
 as with the HINT bit optimization, but avoids this whole ³write the data,
 write it to the log also, then write it again just for good measure²
 behavior.

 For people doing data warehousing work like the poster, this Postgres
 behavior is miserable.  It should be fixed for 8.4 for sure (volunteers?)

For people doing data warehousing I would think the trick would be to do
something like what we do to avoid WAL logging for tables created in the same
transaction. 

That is, if you're loading a lot of data at the same time then all of that
data is going to be aborted or committed and that will happen at the same
time. Ideally we would find a way to insert the data with the hint bits
already set to committed and mark the section of the table as being only
provisionally extended so other transactions wouldn't even look at those pages
until the transaction commits.

This is similar to the abortive attempt to have the abovementioned WAL logging
trick insert the records pre-frozen. I recall there were problems with that
idea though but I don't recall if they were insurmountable or just required
more work.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-15 Thread Kevin Grittner

 On Thu, May 15, 2008 at  5:11 PM, in message
[EMAIL PROTECTED], James Mansion
[EMAIL PROTECTED] wrote: 
 Alvaro Herrera wrote:
 Hint bits are used to mark tuples as created and/or deleted by
 transactions that are know committed or aborted.  To determine the
 visibility of a tuple without such bits set, you need to consult
pg_clog
 and possibly pg_subtrans, so it is an expensive check.  On the
other
   
 So, how come there is this outstanding work to do, which will
inevitably 
 be done, and it
 hasn't been done until it is 'just too late' to avoid getting in the
way 
 of the query?
 
There has been discussion from time to time about setting the hint
bits for tuple inserts which occur within the same database
transaction as the creation of the table into which they're being
inserted.  That would allow people to cover many of the bulk load
situations.  I don't see it on the task list.  (I would also argue
that there is little information lost, even from a forensic
perspective, to writing such rows as frozen.)  Is this idea done,
dead, or is someone working on it?
 
If we could set hint bits on dirty buffer pages after the commit, we'd
cover the OLTP situation.  In many situations, there is a bigger OS
cache than PostgreSQL shared memory, and an attempt to set the bits
soon after the commit would coalesce the two writes into one physical
write using RAM-based access, which would be almost as good.  I don't
know if it's feasible to try to do that after the pages have moved
from the PostgreSQL cache to the OS cache, but it would likely be a
performance win.
 
If we are going to burden any requester process with the job of
setting the hint bits, it would typically be better to burden the one
doing the data modification rather than some random thread later
trying to read data from the table.  Of course, getting work off the
requester processes onto some background worker process is generally
even better.
 
-Kevin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Merlin Moncure

On Wed, May 14, 2008 at 4:09 PM, Doug Eck [EMAIL PROTECTED] wrote:
 I have a large table (~ 2B rows) that contains an indexed timestamp column.
 I am attempting to run a query to determine the number of rows for a given
 day using something like select count(*) from tbl1 where ts between
 '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'.  Explain tells me
 that the query will be done using an index scan (as I would expect), and I
 realize that it is going to take a while.  My question concerns some unusual
 I/O activity on the box (SUSE)  when I run the query.

 For the first couple of minutes I see reads only.  After that vmstat shows
 mixed reads and writes in a ratio of about 1 block read to 5 blocks
 written.  We have determined that files in our data and log partitions are
 being hit, but the file system itself is not growing during this time (it
 appears to be writing over the same chunk of space over and over again).
 Memory on the box is not being swapped while all of this is happening.  I
 would have guessed that a select count(*) would not require a bunch of
 writes, and I can't begin to figure out why the number of blocks written are
 so much higher than the blocks read.  If I modify the where clause to only
 count the rows for a given minute or two, I see the reads but I never see
 the unusual write behavior.

 Any thoughts into what could be going on?  Thanks in advance for your help.

can you post the exact output of explain analyze? (or, at least,
explain if the query takes too long)

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Doug Eck

- Original Message 
From: Merlin Moncure [EMAIL PROTECTED]
To: Doug Eck [EMAIL PROTECTED]
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, May 14, 2008 3:38:23 PM
Subject: Re: [PERFORM] I/O on select count(*)

On Wed, May 14, 2008 at 4:09 PM, Doug Eck [EMAIL PROTECTED] wrote:
 I have a large table (~ 2B rows) that contains an indexed timestamp column.
 I am attempting to run a query to determine the number of rows for a given
 day using something like select count(*) from tbl1 where ts between
 '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'.  Explain tells me
 that the query will be done using an index scan (as I would expect), and I
 realize that it is going to take a while.  My question concerns some unusual
 I/O activity on the box (SUSE)  when I run the query.

 For the first couple of minutes I see reads only.  After that vmstat shows
 mixed reads and writes in a ratio of about 1 block read to 5 blocks
 written.  We have determined that files in our data and log partitions are
 being hit, but the file system itself is not growing during this time (it
 appears to be writing over the same chunk of space over and over again).
 Memory on the box is not being swapped while all of this is happening.  I
 would have guessed that a select count(*) would not require a bunch of
 writes, and I can't begin to figure out why the number of blocks written are
 so much higher than the blocks read.  If I modify the where clause to only
 count the rows for a given minute or two, I see the reads but I never see
 the unusual write behavior.

 Any thoughts into what could be going on?  Thanks in advance for your help.

can you post the exact output of explain analyze? (or, at least,
explain if the query takes too long)

merlin

The query takes a long time to run, so I'll start with the explain output.  I
can run explain analyze (given enough time) if you believe its output
could hold some clues.

db_2008= explain select count(*) from ot_2008_05 where
transact_time between '2008-05-12 00:00:00.000' and '2008-05-12
23:59:59.999';

QUERY PLAN
---
 Aggregate  (cost=10368613.47..10368613.48 rows=1 width=0)
   -  Index Scan using ot_2008_05_ak2 on ot_2008_05  (cost=0.00..10011333.27 
rows=142912078 width=0)
 Index Cond: ((transact_time = '2008-05-12
00:00:00-04'::timestamp with time zone) AND (transact_time =
'2008-05-12 23:59:59.999-04'::timestamp with time zone))
(3 rows)

db_2008=

Doug

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Kevin Grittner

 Doug Eck [EMAIL PROTECTED] wrote: 
 
 I am attempting to run a query to determine the number of rows for a
given 
 day using something like select count(*) from tbl1 where ts between

 '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'.  Explain
tells me 
 that the query will be done using an index scan (as I would expect),
and I 
 realize that it is going to take a while.  My question concerns some
unusual 
 I/O activity on the box (SUSE)  when I run the query.
 
 For the first couple of minutes I see reads only.  After that vmstat
shows 
 mixed reads and writes in a ratio of about 1 block read to 5 blocks
written.  
 
 Any thoughts into what could be going on?  Thanks in advance for your
help.
 
Odd as it may seem, a SELECT can cause a page to be rewritten.
 
If this is the first time that the rows are being read since they were
inserted (or since the database was loaded, including from backup), it
may be rewriting the rows to set hint bits, which can make subsequent
access faster.
 
The best solution may be to vacuum more often.
 
http://archives.postgresql.org/pgsql-performance/2007-12/msg00206.php
 
-Kevin
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Greg Smith


On Wed, 14 May 2008, Kevin Grittner wrote:


If this is the first time that the rows are being read since they were
inserted (or since the database was loaded, including from backup), it
may be rewriting the rows to set hint bits, which can make subsequent
access faster.


This is the second time this has come up recently, and I know it used to 
puzzle me too.  This is a particularly relevant area to document better 
for people doing benchmarking.  As close I've found to a useful commentary 
on this subject is the thread at 
http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php


I still don't completely understand this myself though, if I did I'd add a 
FAQ on it.  Anyone want to lecture for a minute on the birth and care of 
hint bits?  I'll make sure any comments here get onto the wiki.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Alvaro Herrera

Greg Smith wrote:
 On Wed, 14 May 2008, Kevin Grittner wrote:

 If this is the first time that the rows are being read since they were
 inserted (or since the database was loaded, including from backup), it
 may be rewriting the rows to set hint bits, which can make subsequent
 access faster.

 This is the second time this has come up recently, and I know it used to  
 puzzle me too.  This is a particularly relevant area to document better  
 for people doing benchmarking.  As close I've found to a useful 
 commentary on this subject is the thread at  
 http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php

 I still don't completely understand this myself though, if I did I'd add 
 a FAQ on it.  Anyone want to lecture for a minute on the birth and care 
 of hint bits?  I'll make sure any comments here get onto the wiki.

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.  On the other
hand, if the tuple has the bits set, then it's state is known (or, at
worst, it can be calculated easily from your current snapshot, without
looking at pg_clog.)

There are four hint bits:
XMIN_COMMITTED -- creating transaction is known committed
XMIN_ABORTED   -- creating transaction is known aborted
XMAX_COMMITTED -- same, for the deleting transaction
XMAX_ABORTED   -- ditto

If neither of the bits is set, then the transaction is either in
progress (which you can check by examining the list of running
transactions in shared memory) or your process is the first one to check
(in which case, you need to consult pg_clog to know the status, and you
can update the hint bits if you find out a permanent state).


Regarding FAQs, I'm having trouble imagining putting this in the user
FAQ; I think it belongs into the developer's FAQ.  However, a
benchmarker is not going to look there.  Maybe we should start a
benchmarker's FAQ?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Greg Smith


On Wed, 14 May 2008, Alvaro Herrera wrote:

If neither of the bits is set, then the transaction is either in 
progress (which you can check by examining the list of running 
transactions in shared memory) or your process is the first one to check 
(in which case, you need to consult pg_clog to know the status, and you 
can update the hint bits if you find out a permanent state).


So is vacuum helpful here because it will force all that to happen in one 
batch?  To put that another way:  if I've run a manual vacuum, is it true 
that it will have updated all the hint bits to XMIN_COMMITTED for all the 
tuples that were all done when the vacuum started?



Regarding FAQs, I'm having trouble imagining putting this in the user
FAQ; I think it belongs into the developer's FAQ.  However, a
benchmarker is not going to look there.  Maybe we should start a
benchmarker's FAQ?


On the wiki I've started adding a series of things that are 
performance-related FAQs.  There's three of them mixed in the bottom of 
http://wiki.postgresql.org/wiki/Frequently_Asked_Questions right now, 
about slow count(*) and dealing with slow queries.


Here the FAQ would be Why am I seeing all these writes when I'm just 
doing selects on my table?, and if it's mixed in with a lot of other 
performance related notes people should be able to find it.  The answer 
and suggestions should be simple enough to be useful to a user who just 
noticed this behavior, while perhaps going into developer land for those 
who want to know more about the internals.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Jan de Visser

On 5/14/08, Greg Smith [EMAIL PROTECTED] wrote:
 On Wed, 14 May 2008, Alvaro Herrera wrote:


  If neither of the bits is set, then the transaction is either in progress
 (which you can check by examining the list of running transactions in shared
 memory) or your process is the first one to check (in which case, you need
 to consult pg_clog to know the status, and you can update the hint bits if
 you find out a permanent state).
 

  So is vacuum helpful here because it will force all that to happen in one
 batch?  To put that another way:  if I've run a manual vacuum, is it true
 that it will have updated all the hint bits to XMIN_COMMITTED for all the
 tuples that were all done when the vacuum started?

From my benchmarking experience: Yes, vacuum helps. See also below.



  Regarding FAQs, I'm having trouble imagining putting this in the user
  FAQ; I think it belongs into the developer's FAQ.  However, a
  benchmarker is not going to look there.  Maybe we should start a
  benchmarker's FAQ?
 

  On the wiki I've started adding a series of things that are
 performance-related FAQs.  There's three of them mixed in the bottom of
 http://wiki.postgresql.org/wiki/Frequently_Asked_Questions
 right now, about slow count(*) and dealing with slow queries.

  Here the FAQ would be Why am I seeing all these writes when I'm just doing
 selects on my table?, and if it's mixed in with a lot of other performance
 related notes people should be able to find it.  The answer and suggestions
 should be simple enough to be useful to a user who just noticed this
 behavior, while perhaps going into developer land for those who want to know
 more about the internals.

Obviously, this issue is tied to the slow count(*) one, as I found out
the hard way. Consider the following scenario:
* Insert row
* Update that row a couple of times
* Rinse and repeat many times

Now somewhere during that cycle, do a select count(*) just to see
where you are. You will be appalled by how slow that is, due to not
only the usual 'slow count(*)' reasons. This whole hint bit business
makes it even worse, as demonstrated by the fact that running a vacuum
before the count(*) makes the latter noticably faster.

jan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Pavan Deolasee

On Thu, May 15, 2008 at 7:51 AM, Greg Smith [EMAIL PROTECTED] wrote:


 So is vacuum helpful here because it will force all that to happen in one
 batch?  To put that another way:  if I've run a manual vacuum, is it true
 that it will have updated all the hint bits to XMIN_COMMITTED for all the
 tuples that were all done when the vacuum started?


Yes. For that matter, even a plain SELECT or count(*) on the entire
table is good enough. That will check every tuple for visibility and
set it's hint bits.

Another point to note is that the hint bits are checked and set on a
per tuple basis. So especially during index scan, the same heap page
may get rewritten many times. I had suggested in the past that
whenever we set hint bits for a tuple, we should check all other
tuples in the page and set their hint bits too to avoid multiple
writes of the same page. I guess the idea got rejected because of lack
of benchmarks to prove the benefit.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Luke Lonergan

BTW  we¹ve removed HINT bit checking in Greenplum DB and improved the
visibility caching which was enough to provide performance at the same level
as with the HINT bit optimization, but avoids this whole ³write the data,
write it to the log also, then write it again just for good measure²
behavior.

For people doing data warehousing work like the poster, this Postgres
behavior is miserable.  It should be fixed for 8.4 for sure (volunteers?)

BTW  for the poster¹s benefit, you should implement partitioning by date,
then load each partition and VACUUM ANALYZE after each load.  You probably
won¹t need the date index anymore  so your load times will vastly improve
(no indexes), you¹ll store less data (no indexes) and you¹ll be able to do
simpler data management with the partitions.

You may also want to partition AND index if you do a lot of short range
selective date predicates.  Example would be: partition by day, index on
date field, queries selective on date ranges by hour will then select out
only the day needed, then index scan to get the hourly values.  Typically
time-oriented data is nearly time sorted anyway, so you¹ll also get the
benefit of a clustered index.

- Luke


On 5/15/08 10:40 AM, Pavan Deolasee [EMAIL PROTECTED] wrote:

 On Thu, May 15, 2008 at 7:51 AM, Greg Smith [EMAIL PROTECTED] wrote:
 
 
  So is vacuum helpful here because it will force all that to happen in one
  batch?  To put that another way:  if I've run a manual vacuum, is it true
  that it will have updated all the hint bits to XMIN_COMMITTED for all the
  tuples that were all done when the vacuum started?
 
 
 Yes. For that matter, even a plain SELECT or count(*) on the entire
 table is good enough. That will check every tuple for visibility and
 set it's hint bits.
 
 Another point to note is that the hint bits are checked and set on a
 per tuple basis. So especially during index scan, the same heap page
 may get rewritten many times. I had suggested in the past that
 whenever we set hint bits for a tuple, we should check all other
 tuples in the page and set their hint bits too to avoid multiple
 writes of the same page. I guess the idea got rejected because of lack
 of benchmarks to prove the benefit.
 
 Thanks,
 Pavan
 
 --
 Pavan Deolasee
 EnterpriseDB http://www.enterprisedb.com
 
 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Greg Smith


On Thu, 15 May 2008, Pavan Deolasee wrote:

I had suggested in the past that whenever we set hint bits for a tuple, 
we should check all other tuples in the page and set their hint bits too 
to avoid multiple writes of the same page. I guess the idea got rejected 
because of lack of benchmarks to prove the benefit.


From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the 
impression the idea was to have the background writer get involved to help 
with this particular situation.  The way things are setup right now, I 
would guess it's impractical for an individual client to be forced to wait 
for all the tuples in a block to be checked just because it ran into one 
tuple that needed its hint bits refreshed.


If the pages that had any hint bit updates since they were read/created 
were made easy to identify (maybe they already are), the writer could do 
the kind of scan you suggest anytime it was about to evict that page. 
That wouldn't be in the client's critical path and it would maximize the 
possible improvement here.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] I/O on select count(*)

2008-05-14 Thread Joshua D. Drake

On Thu, 15 May 2008 10:52:01 +0800
Luke Lonergan [EMAIL PROTECTED] wrote:

 BTW  we¹ve removed HINT bit checking in Greenplum DB and improved the
 visibility caching which was enough to provide performance at the
 same level as with the HINT bit optimization, but avoids this whole
 ³write the data, write it to the log also, then write it again just
 for good measure² behavior.
 
 For people doing data warehousing work like the poster, this Postgres
 behavior is miserable.  It should be fixed for 8.4 for sure
 (volunteers?)

Donations? You have the code Luke :)

Sincerely,

Joshua D. Drake

P.S. Sorry for the almost bad Star Wars pun.

-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate




signature.asc
Description: PGP signature

45 matches

Mail list logo