Re: [PERFORM] lru_multiplier and backend page write-outs

2008-11-06 Thread Peter Schuller
Hello,

 At one point I envisioned making it smart enough to try and handle the 
 scenario you describe--on an idle system, you may very well want to write 
 out dirty and recently accessed buffers if there's nothing else going on. 
 But such behavior is counter-productive on a busy system, which is why a 
 similar mechanism that existed before 8.3 was removed.  Making that only 
 happen when idle requires a metric for what busy means, which is tricky 
 to do given the information available to this particular process.
 
 Short version:  if you never fill the buffer cache, buffers_clean will 
 always be zero, and you'll only see writes by checkpoints and things not 
 operating with the standard client buffer allocation mechanism.  Which 
 brings us to...

Sure. I am not really out to get the background writer to
pre-emptively do idle trickling. Though I can see cases where one
might care about this (such as lessening the impact of OS buffer cache
delays on checkpoints), it's not what I am after now.

  One theory: Is it the auto vacuum process? Stracing those I've seen
  that they very often to writes directly to disk.
 
 In order to keep it from using up the whole cache with maintenance 
 overhead, vacuum allocates a 256K ring of buffers and use re-uses ones 
 from there whenever possible.  That will generate buffer_backend writes 
 when that ring fills but it has more left to scan.  Your theory that all 
 the backend writes are coming from vacuum seems consistant with what 
 you've described.

The bit that is inconsistent with this theory, given the above ring
buffer desription, is that I saw the backend write-out count
increasing constantlyduring the write activity I was generating to the
database. However (because in this particular case it was a small
database used for some latency related testing), no table was ever
large enough that 256k buffers would ever be filled by the process of
vacuuming a single table. Most tables would likely have been a handful
to a couple of hundred of pages large.

In addition, when I say constantly above I mean that the count
increases even between successive SELECT:s (of the stat table) with
only a second or two in between. In the abscence of long-running
vacuum's, that discounts vacuuming because the naptime is 1 minute.

In fact this already discounted vacuuming even without the added
information you provided above, but I didn't realize when originally
posting.

The reason I mentioned vacuuming was that the use case is such that we
do have a lot of tables constantly getting writes and updates, but
they are all small.

Anything else known that might be generating the writes, if it is not
vacuuming?

 You might even want to drop the two background writer parameters you've 
 tweaked upwards back down closer to their original values.  I get the 
 impression you might have increased those hoping for more background 
 writer work because you weren't seeing any.  If you ever do get to where 
 your buffer cache is full and the background writer starts doing 
 something, those could jump from ineffective to wastefully heavy at that 
 point.

I tweaked it in order to eliminate backends having to do
synchrounous (with respect to the operating system even if not with
respect to the underlying device) writes.

The idea is that writes to the operating system are less
understood/controlled, in terms of any latency they may case. It would
be very nice if the backend writes were always zero under normal
circumstances (or at least growing very very rarely in edge cases
where the JIT policy did not suceed), in order to make it a more
relevant and rare observation that the backend write-outs are
systematically increasing.

On this topic btw, was it considered to allow the administrator to
specify a fixed-size margin to use when applying the JIT policy? (The
JIT policy and logic itself being exactly the same still.)

Especially with larger buffer caches, that would perhaps allow the
administrator to make a call to truly eliminate synchronous writes
during normal operation, while not adversely affecting anything (if
the buffer cache is 1 GB, having a margin of say 50 MB does not really
matter much in terms of wasting memory, yet could have a significant
impact on eliminating synchronous write-outs).

On a system where you really want to keep backend writes to exactly 0
under normal circumstances (discounting vacuuming), and having a large
buffer cache (say the one gig), it might be nice to be able to say ok
- I have 1 GB of buffer cache. for the purpose of the JIT algorithm,
please pretend it's only 900 MB. The result is 100 MB of constantly
sized margin, with respect to ensuring writes are asynchronous.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpexaYHLBE8h.pgp
Description: PGP signature


Re: [PERFORM] lru_multiplier and backend page write-outs

2008-11-06 Thread Peter Schuller
  no table was ever large enough that 256k buffers would ever be filled by 
  the process of vacuuming a single table.
 
 Not 256K buffers--256K, 32 buffers.

Ok. 

  In addition, when I say constantly above I mean that the count
  increases even between successive SELECT:s (of the stat table) with
  only a second or two in between.
 
 Writes to the database when only doing read operations are usually related 
 to hint bits:  http://wiki.postgresql.org/wiki/Hint_Bits

Sorry, I didn't mean to imply read-only operations (I did read the
hint bits information a while back though). What I meant was that
while I was constantly generating the insert/delete/update activity, I
was selecting the bg writer stats with only a second or two in
between. The intent was to convey that the count of backend written
pages was systematically and constantly (as in a few hundreds per
handful of seconds) increasing, in spite of no long running vacuum and
the buffer cache not being close to full.

  On this topic btw, was it considered to allow the administrator to
  specify a fixed-size margin to use when applying the JIT policy?
 
 Right now, there's no way to know exactly what's in the buffer cache 
 without scanning the individual buffers, which requires locking their 
 headers so you can see them consistently.  No one process can get the big 
 picture without doing something intrusive like that, and on a busy system 
 the overhead of collecting more data to know how exactly far ahead the 
 cleaning is can drag down overall performance.  A lot can happen while the 
 background writer is sleeping.

Understood.

 One next-generation design which has been sketched out but not even 
 prototyped would take cleaned buffers and add them to the internal list of 
 buffers that are free, which right now is usually empty on the theory that 
 cached data is always more useful than a reserved buffer.  If you 
 developed a reasonable model for how many buffers you needed and padded 
 that appropriately, that's the easiest way (given the rest of the buffer 
 manager code) to get close to ensuring there aren't any backend writes. 
 Because you've got the OS buffering writes anyway in most cases, it's hard 
 to pin down whether that actually improved worst-case latency though. And 
 moving in that direction always seems to reduce average throughput even in 
 write-heavy benchmarks.

Ok.

 The important thing to remember is that the underlying OS has its own read 
 and write caching mechanisms here, and unless the PostgreSQL ones are 
 measurably better than those you might as well let the OS manage the 
 problem instead.

The problem though is that though the OS may be good in the common
cases it is designed for, it can have specific features that are
directly counter-productive if your goals do not line up with that of
the commonly designed-for use case (in particular, if you care about
latency a lot and not necessarily about absolute max throughput).

For example, in Linux up until recently if not still, there is the
1024 per-inode buffer limit that limited the number of buffers written
as a result of expiry, which means that when PostgreSQL does its
fsync(), you may end up having a lot more to write out than what would
have been the case if the centisecs_expiry had been enforced,
regardless of whether PostgreSQL was tuned to write dirty pages out
sufficiently aggressively. If the amount built up exceeds the capacity
of the RAID controller cache...

I had a case where I suspect this was exaserbating the
situation. Manually doing a 'sync' on the system every few seconds
noticably helped (the theory being, because it forced page write-outs
to happen earlier and in smaller storms).

  It's easy to demonstrate that's happening when you give 
 a decent amount of memory to shared_buffers, it's much harder to prove 
 that's the case for an improved write scheduling algorithm.  Stepping back 
 a bit, you might even consider that one reason PostgreSQL has grown as 
 well as it has in scalability is exactly because it's been riding 
 improvements the underlying OS in many of these cases, rather than trying 
 to do all the I/O scheduling itself.

Sure. In this case with the backend writes, I am nore interesting in
understanding better what is happening and having better indications
of when backends block on I/O, than necessarily having a proven
improvement in throughput or latency. It makes it easier to reason
about what is happening when you *do* have a measured performance
problem.

Thanks for all the insightful information.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpSaSogvAGtf.pgp
Description: PGP signature


[PERFORM] lru_multiplier and backend page write-outs

2008-11-05 Thread Peter Schuller
Hello,

I've had the feeling for a while that the pg_stat_bgwriter statistics
doesn't work quite the way I have understood it (based on the
excellent [1] and the pg docs).

I am now monitoring a database that has an lru_multiplier of 4.0, a
delay of 200ms and a maxpages of 1000. Current stats:

postgres=# select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | 
maxwritten_clean | buffers_backend | buffers_alloc 
---+-++---+--+-+---
   241 |  17 |  72803 | 0 | 
   0 |   81015 | 81708
(1 row)

This is while the database is undergoing continuous activity (almost
exclusively writing), but at a rate that does not saturate underlying
storage (caching raid controller, all write ops are fast, cache is
never filled).

In addition, PostgreSQL is not even close to even filling it's buffer
cache. The buffer cache is configured at 1 GB, and the resident size
of the PostgreSQL process is only 80-90 MB so far. So even
independently of any lru multplier setting, delays and whatever else,
I don't see why any backend would ever have to do its own writeouts in
order to allocate a page from the buffer cache.

One theory: Is it the auto vacuum process? Stracing those I've seen
that they very often to writes directly to disk.

In any case, the reason I am fixating on buffers_backend is that I am
after a clear indication whether any normal backend (non-autovacuum
or anything like that) is ever having to block on disk writes, other
than WAL fsync:s.

Is a non-zero buffers_backend consistent with expected behavior?

[1] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpq5Wyr2Jr4u.pgp
Description: PGP signature


Re: [PERFORM] NOW vs CURRENT_DATE

2008-08-24 Thread Peter Schuller
 I just discover a big not only big huge difference between NOW() and 
 CURRENT_DATE.
 
 Did you already know about it and do you know why ?
 
 DELETE FROM blacklist where bl_date  (NOW() - interval '2 DAY');
 on 6 000 000 of records
 699 ms
 
 DELETE FROM blacklist where bl_date  (CURRENT_DATE - interval '2 DAY');
 on 6 000 000 of records

Is this a one-off run after each other (e.g. with a ROLLBACK in
between)? If so I suspect the difference is due to caching and if you
re-run the NOW() version it would also be fast.

Also, NOW() is equivalent to CURRENT_TIMESTAMP() rather than
CURRENT_DATE(). Perhaps the date vs. timestamp has some implication of
how they query is planned.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpJyATTQYR2A.pgp
Description: PGP signature


Re: [PERFORM] Large number of tables slow insert

2008-08-24 Thread Peter Schuller
 I use Postgresql 8.3.1-1 to store a lot of data coming from a large amount
 of sensors. In order to have good performances on querying by timestamp on
 each sensor, I partitionned my measures table for each sensor. Thus I create
 a lot of tables.
 I simulated a large sensor network with 3000 nodes so I have ~3000 tables.
 And it appears that each insert (in separate transactions) in the database
 takes about 300ms (3-4 insert per second) in tables where there is just few
 tuples ( 10). I think you can understand that it's not efficient at all
 because I need to treat a lot of inserts.
 
 Do you have any idea why it is that slow ? and how can have good insert ?

 My test machine: Intel p4 2.4ghz, 512mb ram, Ubuntu Server (ext3)
 iostat tells me that I do : 0.5MB/s reading and ~6-7MB/s writing while
 constant insert

Have you checked what you are bottlenecking on - CPU or disk? Try
iostat/top/etc during the inserts. Also check actual disk utilizatio
(iostat -x on linux/freebsd; varies on others) to see what percentage
of time the disk/storage device is busy.

You say you have 3-4 inserts/second causing 6-7 MB/s writing. That
suggests to me the inserts are fairly large. Are they in the MB range,
which would account for the I/O?

My suspicion is that you are bottlenecking on CPU, since in my
experience there is definitely something surprisingly slow about
encoding/decoding data at the protocol level or somewhere else that is
involved in backend/client communication. I.e, I suspect your client
and/or server is spending a lot of CPU time with things not directly
related to the actual table inserts. If so, various suggested schemes
w.r.t. indexing, table bloat etc won't help at all.

In short, 6-7 MB/second would be fairly consistent with INSERT/COPY
operations being CPU bound on a modern CPU, in my experience. It may
be that this is entirely untrue in your case, but it sounds like a
reasonable thing to at least consider.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp2G4jcm34pt.pgp
Description: PGP signature


[PERFORM] Identifying the nature of blocking I/O

2008-08-22 Thread Peter Schuller
[for the purpose of this post, 'blocking' refers to an I/O operation
taking a long time for reasons other than the amount of work the I/O
operation itself actually implies; not to use of blocking I/O calls or
anything like that]

Hello,

I have a situation in which deterministic latency is a lot more
important than throughput.

I realize this is a hugely complex topic and that there is inteaction
between many different things (pg buffer cache, os buffer cache, raid
controller caching, wal buffers, storage layout, etc). I already know
several things I definitely want to do to improve things.

But in general, it would be very interesting to see, at any given
moment, what PostgreSQL backends are actually blocking on from the
perspective of PostgreSQL.

So for example, if I have 30 COMMIT:s that are active, to know whether
it is simply waiting on the WAL fsync or actually waiting on a data
fsync because a checkpoint is being created. or similarly, for
non-commits whether they are blocking because WAL buffers is full and
writing them out is blocking, etc.

This would make it easier to observe and draw conclusions when
tweaking different things in pg/the os/the raid controller.

Is there currently a way of dumping such information? I.e., asking PG
what are backends waiting on right now?.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpv5jmsijPCh.pgp
Description: PGP signature


[PERFORM] VACUUM ANALYZE blocking both reads and writes to a table

2008-06-30 Thread Peter Schuller
Hello,

my understanding, and generally my experience, has been that VACUUM
and VACUUM ANALYZE (but not VACUUM FULL) are never supposed to block
neither SELECT:s nor UPDATE:s/INSERT:s/DELETE:s to a table.

This is seemingly confirmed by reading the explicit locking
documentation, in terms of the locks acquired by various forms of
vacuuming, and with which other lock modes they conflict.

I have now seen it happen twice that a VACUMM ANALYZE has seemingly
been the triggering factor to blocking queries.

In the first instance, we had two particularly interesting things
going on:

  VACUUM ANALYZE thetable
  LOCK TABLE thetable IN ACCESS SHARE MODE

In addition there was one SELECT from the table, and a bunch of
INSERT:s (this is based on pg_stat_activity).

While I am unsure of why there is an explicit LOCK going on with
ACCESS SHARE MODE (no explicit locking is ever done on this table by
the application), it is supposed to be the locking used for selects. I
suspect it may be a referential integrity related acquisition
generated by PG.

The second time it happned, there was again a single SELECT, a bunch
of INSERT:s, and then:

  VACUUM ANALYZE thetable

This time there was no explicit LOCK visible.

In both cases, actitivy was completely blocked until the VACUUM
ANALYZE completed.

Does anyone have input on why this could be happening? The PostgreSQL
version is 8.2.4[1]. Am I correct in that it *should* not be possible
for this to happen?

For the next time this happens I will try to have a query prepared
that will dump as much relevant information as possible regarding
acquired locks. 

If it makes a difference the SELECT does have a subselect that also
selcts from the same table - a MAX(colum) on an indexed column.

[1] I did check the ChangeLog for 8.2.x releases above .4, and the 8.3
releases, but did not see anything that indicated locking/conflict
related fixes in relation to vacuums.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpzyHqvAItZX.pgp
Description: PGP signature


Re: [PERFORM] VACUUM ANALYZE blocking both reads and writes to a table

2008-06-30 Thread Peter Schuller
Hello,

 No.  VACUUM takes an exclusive lock at the end of the operation to
 truncate empty pages.  (If it cannot get the lock then it'll just skip
 this step.)  In 8.2.4 there was a bug that caused it to sleep
 according to vacuum_delay during the scan to identify possibly empty
 pages.  This was fixed in 8.2.5:

[snip revision log]

Thank you very much! This does indeed seem to be the likely
culprit. Will try to either upgrade, or if not possible in time for
the next occurance, confirm that this is what is happening based on
pg_locks.

Thanks again for the very informative response.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpUjtABeKgmx.pgp
Description: PGP signature


Re: [PERFORM] VACUUM ANALYZE blocking both reads and writes to a table

2008-06-30 Thread Peter Schuller
Actually, while on the topic:

 date: 2007-09-10 13:58:50 -0400;  author: alvherre;  state: Exp;  lines: 
 +6 -2;
 Remove the vacuum_delay_point call in count_nondeletable_pages, because 
 we hold
 an exclusive lock on the table at this point, which we want to release as 
 soon
 as possible.  This is called in the phase of lazy vacuum where we 
 truncate the
 empty pages at the end of the table.

Even with the fix the lock is held. Is the operation expected to be
fast (for some definition of fast) and in-memory, or is this
something that causes significant disk I/O and/or scales badly with
table size or similar?

I.e., is this enough that, even without the .4 bug, one should not
really consider VACUUM ANALYZE non-blocking with respect to other
transactions?

(I realize various exclusive locks are taken for short periods of time
even for things that are officially declared non-blocking; the
question is whether this falls into this category.)

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp1Tc16hAGGQ.pgp
Description: PGP signature


Re: [PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine

2008-02-17 Thread Peter Schuller
 PostgreSQL only uses direct I/O for writing to the WAL; everything else
 goes through the regular OS buffer cache unless you force it to do
 otherwise at the OS level (like some Solaris setups do with
 forcedirectio).  This is one reason it still make not make sense to give
 an extremely high percentage of RAM to PostgreSQL even with improvements
 in managing it.  

Ok - thank you for the input (that goes for everyone).

 Another is that shared_buffers memory has to be 
 reconciled with disk at every checkpoint, where OS buffers do not.

Hmm. Am I interpreting that correctly in that dirty buffers need to be flushed 
to disk at checkpoints? That makes perfect sense - but why would that not be 
the case with OS buffers? My understanding is that the point of the 
checkpoint is to essentially obsolete old WAL data in order to recycle the 
space, which would require flushing the data in question first (i.e.,  
normally you just fsync the WAL, but when you want to recycle space you need 
fsync() for the barrier and are then free to nuke the old WAL).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.


[PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine

2008-02-15 Thread Peter Schuller
Hello,

my impression has been that in the past, there has been a general
semi-consensus that upping shared_buffers to use the majority of RAM
has not generally been recommended, with reliance on the buffer cache
instead being the recommendation.

Given the changes that have gone into 8.3, in particular with regards
to minimizing the impact of large sequential scans, would it be
correct to say that given that

  - enough memory is left for other PG bits (sort mems and whatnot else)
  - only PG is running on the machine
  - you're on 64 bit so do not run into address space issues
  - the database working set is larger than RAM

it would be generally advisable to pump up shared_buffers pretty much
as far as possible instead of relying on the buffer cache?

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpRe9ntcHta3.pgp
Description: PGP signature


Re: [PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine

2008-02-15 Thread Peter Schuller
 PostgreSQL still depends on the OS for file access and caching. I
 think that the current recommendation is to have up to 25% of your
 RAM in the shared buffer cache.

This feels strange. Given a reasonable amount of RAM (let's say 8 GB
in this case), I cannot imagine why 75% of that would be efficiently
used for anything but the buffer cache (ignoring work_mem, stacks,
etc). Obviously the OS will need memory to do it's usual stuff
(buffering to do efficient I/O, and so on). But the need for that
should not increase with the amount of RAM in the machine, all else
being equal.

What type of file I/O, other than reading pages of PostgreSQL data
which are eligable for the PostgreSQL buffer cache, does PostgreSQL do
that would take advantage of the operating system caching so much
data?

(Assuming the database is not extreme to the point of file system meta
data being huge.)

If the 25% rule still holds true, even under circumstances where the
assumption is that the PostgreSQL buffer cache is more efficient (in
terms of hit ratio) at caching PostgreSQL database data pages, it
would be useful to understand why in order to understand the
trade-offs involved and make appropriate decisions.

Or is it a matter of PostgreSQL doing non-direct I/O, such that
anything cached in shared_buffers will also be cached by the OS?

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpQMYjRMfywD.pgp
Description: PGP signature


[PERFORM] Non-blocking vacuum full

2007-09-28 Thread Peter Schuller
Hello,

I was wondering whether any thought has previously been given to
having a non-blocking vacuum full, in the sense of space reclamation
and table compactation.

The motivation is that it is useful to be able to assume that
operations that span a table will *roughtly* scale linearly with the
size of the table. But when you have a table that over an extended
period of time begins small, grows large, and grows small again (where
large might be, say, 200 GB), that assumption is most definitely
not correct when you're on the downward slope of that graph. Having
this assumption remain true simplifies things a lot for certain
workloads (= my particular work load ;)).

I have only looked very very briefly at the PG code so I don't know
how far fetched it is, but my thought was that it should be possible
to have a slow background process (similar to normal non-full vacuums
nows) that would, instead of registering dead tuples in the FSM, move
live tuples around.

Combine that slow moving operations with a policy to a new tuple space
allocation policy that prefers earlier locations on-disk, it should in
time result in a situation where the physical on-disk file contains
only dead tuples after a certain percentage location. At this point
the file can be truncated, giving space back to the OS as well as
eliminating all that dead space from having to be covered by
sequential scans on the table.

This does of course increase the total cost of all updates and
deletes, but would be very useful in some senarios. It also has the
interesting property that the scan for live tuples to move need not
touch the entire table to be effective; it could by design be applied
to the last n percentage of the table, where n would be scaled
appropriately with the frequency of the checks relative to
update/insert frequency.

Other benefits:

  * Never vacuum full - EVER. Not even after discovering too small
max_fsm_pages or too infrequent vacuums and needing to retroactively
shrink the table.
  * Increased locality in general; even if one does not care about
the diskspace or sequential scanning. Particularly relevant for low-update 
frequency
tables suffering from sudden shrinkage, where a blocking VACUUM FULL Is not
acceptable.
  * Non-blocking CLUSTER is perhaps suddently more trivial to implement?
Or at least SORTOFCLUSTER when you want it for reasons other than
perfect order (mostly sorted).

Opinions/thoughts?

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpFbOXmSf908.pgp
Description: PGP signature


Re: [PERFORM] max_fsm_pages, shared_buffers and checkpoint_segments

2007-05-23 Thread Peter Schuller
 increasing checkpoint_segments,which is also a disk thing. However, setting
 it to 25, and then increasing any of the other 2 variables, the postgresql
 daemon stops working. meaning it does not start upon reboot. When I bring

Sounds like you need to increase your shared memory limits.
Unfortunately this will require a reboot on FreeBSD :(

See:

   http://www.postgresql.org/docs/8.2/static/kernel-resources.html

Last time I checked PostgreSQL should be complaining about the shared
memory on startup rather than silently fail though. Check your logs
perhaps. Though I believe the RC script will cause the message to be
printed interactively at the console too, if you run it. (Assuming you
are using it installed from ports).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Postgres Benchmark Results

2007-05-22 Thread Peter Schuller
 - Deferred Transactions, since adding a comment to a blog post
 doesn't need the same guarantees than submitting a paid order, it makes
 sense that the application could tell postgres which transactions we
 care about if power is lost. This will massively boost performance for
 websites I believe.

This would be massively useful. Very often all I care about is that the
transaction is semantically committed; that is, that other transactions
starting from that moment will see the modifications done. As opposed to
actually persisting data to disk.

In particular I have a situation where I attempt to utilize available
hardware by using concurrency. The problem is that I have to either
hugely complicate my client code or COMMIT more often than I would like
in order to satisfy dependencies between different transactions. If a
deferred/delayed commit were possible I could get all the performance
benefit without the code complexity, and with no penalty (because in
this case persistence is not important).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-04-04 Thread Peter Schuller
Hello,

 The next question then is whether anything in your postgres configuration
 is preventing it getting useful performance from the OS. What settings
 have you changed in postgresql.conf?

The only options not commented out are the following (it's not even
tweaked for buffer sizes and such, since in this case I am not
interested in things like sort performance and cache locality other
than as an afterthought):

hba_file = '/etc/postgresql/8.1/main/pg_hba.conf'
ident_file = '/etc/postgresql/8.1/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/8.1-main.pid'
listen_addresses = '*'
port = 5432
max_connections = 100
unix_socket_directory = '/var/run/postgresql'
ssl = true
shared_buffers = 1000
log_line_prefix = '%t '
stats_command_string = on
stats_row_level = on
autovacuum = on
lc_messages = 'C'
lc_monetary = 'C'
lc_numeric = 'C'
lc_time = 'C'

 Are you using any unusual settings within the OS itself?

No. It's a pretty standard kernel. The only local tweaking done is
enabling/disabling various things; there are no special patches used
or attempts to create a minimalistic kernel or anything like that.

 You're forgetting the LIMIT clause. For the straight index scan, the
 query aborts when the LIMIT is reached having scanned only the specified
 number of index rows (plus any index entries that turned out to be dead
 in the heap). For the bitmap scan case, the limit can be applied only after
 the heap scan is under way, therefore the index scan to build the bitmap
 will need to scan ~50k rows, not the 10k specified in the limit, so the
 amount of time spent scanning the index is 50 times larger than in the
 straight index scan case.

Ok - makes sense that it has to scan the entire subset of the index
for the value in question. I will have to tweak the CPU/disk costs
settings (which I have, on purpose, not yet done).

 However, I do suspect you have a problem here somewhere, because in my
 tests the time taken to do the bitmap index scan on 50k rows, with the
 index in cache, is on the order of 30ms (where the data is cached in
 shared_buffers) to 60ms (where the data is cached by the OS). That's on
 a 2.8GHz xeon.

This is on a machine with 2.33GHz xeons and I wasn't trying to
exaggerate. I timed it and it is CPU bound (in userspace; next to no
system CPU usage at all) for about 15 seconds for the case of
selecting with a limit of 1.

Given that there is no disk activity I can't imagine any buffer sizes
or such affecting this other than userspace vs. kernelspace CPU
concerns (since obviously the data being worked on is in RAM). Or am I
missing something?

It is worth noting that the SELECT of fewer entries is entirely disk
bound; there is almost no CPU usage whatsoever. Even taking the
cumulative CPU usage into account (gut feeling calculation, nothing
scientific) and multiplying by 50 you are nowhere near 15 seconds of
CPU boundness. So it is indeed strange.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp5zNDEU58j4.pgp
Description: PGP signature


Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-04-04 Thread Peter Schuller
Hello,

 I'd always do benchmarks with a realistic value of shared_buffers (i.e.
 much higher than that).
 
 Another thought that comes to mind is that the bitmap index scan does
 depend on the size of work_mem.
 
 Try increasing your shared_buffers to a reasonable working value (say
 10%-15% of RAM - I was testing on a machine with 4GB of RAM, using a
 shared_buffers setting of 5), and increase work_mem to 16364, and
 see if there are any noticable changes in behaviour.

Increasing the buffer size and work_mem did have a significant
effect. I can understand it in the case of the heap scan, but I am
still surprised at the index scan. Could pg be serializing the entire
query as a result of insufficient buffers/work_mem to satisfy multiple
concurrent queries?

With both turned up, not only is the heap scan no longer visibly CPU
bound, I am seeing some nice scaling in terms of disk I/O. I have not
yet benchmarked to the point of being able to say whether it's
entirely linear, but it certainly seems to at least be approaching the
ballpark.

Thank you for the help! I guess I made a bad call not tweaking
this. My thinking was that I explicitly did not want to turn it up so
that I could benchmark the raw performance of disk I/O, rather than
having things be cached in memory more than it would already be. But
apparantly it had other side-effects I did not consider.

Thanks again,

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp0TSiIQlMjC.pgp
Description: PGP signature


Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-04-02 Thread Peter Schuller
Hello,

 If you are dealing with timed data or similar, you may consider to
 partition your table(s).

Unfortunately this is not the case; the insertion is more or less
random (not quite, but for the purpose of this problem it is).

Thanks for the pointers though. That is sure to be useful in some
other context down the road.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgp2AH9xvZCzu.pgp
Description: PGP signature


Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-04-02 Thread Peter Schuller
Hello,

  SELECT * FROM test WHERE value = 'xxx' LIMIT 1000;
 
 I tested this on a 14-way software raid10 on freebsd, using pg 8.1.6, and
 couldn't reproduce anything like it. With one client I get about 200 disk
 requests per second, scaling almost exactly linearly for the first 5 or so
 clients, as expected. At 14 clients it was down to about 150 reqs/sec per
 client, but the total throughput continued to increase with additional
 concurrency up to about 60 clients, giving about 3600 reqs/sec (260 per
 disk, which is about right for 10krpm scsi disks under highly concurrent
 random loads).

Ok. That is very intersting; so there is definitely nothing
fundamental in PG that prevents the scaling (even if on FreeBSD).

 A good question. Have you tried testing the disks directly? e.g. create
 some huge files, and run a few concurrent random readers on them? That
 would test the array and the filesystem without involving postgres.

I have confirmed that I am seeing expected performance for random
short and highly concurrent reads in one large ( 200 GB) file. The
I/O is done using libaio however, so depending on implementation I
suppose the I/O scheduling behavior of the fs/raid driver might be
affected compared to having a number of concurrent threads doing
synchronous reads. I will try to confirm performance in a way that
will more closely match PostgreSQL's behavior.

I have to say though that I will be pretty surprised if the
performance is not matched in that test.

Is there any chance there is some operation system conditional code in
pg itself that might affect this behavior? Some kind of purposeful
serialization of I/O for example (even if that sounds like an
extremely strange thing to do)?

 This is entirely expected. With the larger row count, it is more likely
 (or so the planner estimates) that rows will need to be fetched from
 adjacent or at least nearby blocks, thus a plan which fetches rows in
 physical table order rather than index order would be expected to be
 superior. The planner takes into account the estimated startup cost and
 per-row cost when planning LIMIT queries; therefore it is no surprise
 that for larger limits, it switches to a plan with a higher startup cost
 but lower per-row cost.

Roger that, makes sense. I had misunderstood the meaning of the heap
scan.

 Most likely your index is small enough that large parts of it will be
 cached in RAM, so that the scan of the index to build the bitmap does
 not need to hit the disk much if at all.

Even so however, several seconds of CPU activity to scan the index for
a few tens of thousands of entries sounds a bit excessive. Or does it
not? Because at that level, the CPU bound period alone is approaching
the time it would take to seek for each entry instead. But then I
presume the amount of work is similar/the same for the other case,
except it's being done at the beginning of the query instead of before
each seek.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpK1zQziyuhZ.pgp
Description: PGP signature


[PERFORM] Scaling SELECT:s with the number of disks on a stripe

2007-03-29 Thread Peter Schuller
Hello,

I am looking to use PostgreSQL for storing some very simple flat data
mostly in a single table. The amount of data will be in the hundreds
of gigabytes range. Each row is on the order of 100-300 bytes in size;
in other words, small enough that I am expecting disk I/O to be seek
bound (even if PostgreSQL reads a full pg page at a time, since a page
is significantly smaller than the stripe size of the volume).

The only important performance characteristics are insertion/deletion
performance, and the performance of trivial SELECT queries whose WHERE
clause tests equality on one of the columns.

Other than absolute performance, an important goal is to be able to
scale fairly linearly with the number of underlying disk drives. We
are fully willing to take a disk seek per item selected, as long as it
scales.

To this end I have been doing some benchmarking to see whether the
plan is going to be feasable. On a 12 disk hardware stripe, insertion
performance does scale somewhat with concurrent inserters. However, I
am seeing surprising effects with SELECT:s: a single selecter
generates the same amount of disk activity as two concurrent selecters
(I was easily expecting about twice).

The query is simple:

SELECT * FROM test WHERE value = 'xxx' LIMIT 1000;

No ordering, no joins, no nothing. Selecting concurrently with two
different values of 'xxx' yields the same amount of disk activity
(never any significant CPU activity). Note that the total amount of
data is too large to fit in RAM ( 500 million rows), and the number
of distinct values in the value column is 1. The column in the
WHERE clause is indexed.

So my first question is - why am I not seeing this scaling? The
absolute amount of disk activity with a single selecter is consistent
with what I would expect from a SINGLE disk, which is completely
expected since I never thought PostgreSQL would introduce disk I/O
concurrency on its own. But this means that adding additional readers
doing random-access reads *should* scale very well with 12 underlying
disks in a stripe.

(Note that I have seen fairly similar results on other RAID variants
too, including software RAID5 (yes yes I know), in addition to the
hardware stripe.)

These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1.

Secondly, I am seeing a query plan switch after a certain
threshold. Observe:

perftest=# explain select * from test where val='7433' limit 1000; 
   QUERY PLAN   
 
-
 Limit  (cost=0.00..4016.50 rows=1000 width=143)
   -  Index Scan using test_val_ix on test  (cost=0.00..206620.88 rows=51443 
width=143)
 Index Cond: ((val)::text = '7433'::text)
(3 rows)

Now increasing to a limit of 1:

perftest=# explain select * from test where val='7433' limit 1;
  QUERY PLAN
  
--
 Limit  (cost=360.05..38393.36 rows=1 width=143)
   -  Bitmap Heap Scan on test  (cost=360.05..196014.82 rows=51443 width=143)
 Recheck Cond: ((val)::text = '7433'::text)
 -  Bitmap Index Scan on test_val_ix  (cost=0.00..360.05 rows=51443 
width=0)
   Index Cond: ((val)::text = '7433'::text)
(5 rows)

The interesting part is that the latter query is entirely CPU bound
(no disk I/O at all) for an extended period of time before even
beginning to read data from disk. And when it *does* start performing
disk I/O, the performance is about the same as for the other case. In
other words, the change in query plan seems to do nothing but add
overhead.

What is the bitmap heap scan supposed to be doing that would increase
performance above a seek once per matching row plan? I haven't been
able to Google my way to what the intended benefit is of a heap scan
vs. a plain index scan.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



pgpoVILTP5fwa.pgp
Description: PGP signature