Re: [PERFORM] 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

2009-03-13 Thread Alan Stange

Gregory Stark wrote:

A minute ago I said:

 AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit. It
 would be great to hear if you could catch the ear of the right people to get
 an implementation committed. Depending on how the i/o scheduler system is
 written it might not even be hard -- the Linux implementation of WILLNEED is
 all of 20 lines.

I noticed after sending it that that's slightly unfair. The 20-line function
calls another function (which calls another function) to do the real readahead
work. That function (mm/readahead.c:__do_page_cache_readahead()) is 48 lines.

  

It's implemented.   I'm guessing it's not what you want to see though:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] PostgreSQL and Ultrasparc T1

2005-12-20 Thread Alan Stange

Jignesh K. Shah wrote:

I guess it depends on what you term as your metric for measurement.
If it is just one query execution time .. It may not be the best on 
UltraSPARC T1.
But if you have more than 8 complex queries running simultaneously, 
UltraSPARC T1 can do well compared comparatively provided the 
application can scale also along with it.


I just want to clarify one issue here.   It's my understanding that the 
8-core, 4 hardware thread (known as strands) system is seen as a 32 cpu 
system by Solaris. 

So, one could have up to 32 postgresql processes running in parallel on 
the current systems (assuming the application can scale).


-- Alan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] PostgreSQL and Ultrasparc T1

2005-12-20 Thread Alan Stange

David Lang wrote:

On Tue, 20 Dec 2005, Alan Stange wrote:


Jignesh K. Shah wrote:

I guess it depends on what you term as your metric for measurement.
If it is just one query execution time .. It may not be the best on 
UltraSPARC T1.
But if you have more than 8 complex queries running simultaneously, 
UltraSPARC T1 can do well compared comparatively provided the 
application can scale also along with it.


I just want to clarify one issue here.   It's my understanding that 
the 8-core, 4 hardware thread (known as strands) system is seen as a 
32 cpu system by Solaris. So, one could have up to 32 postgresql 
processes running in parallel on the current systems (assuming the 
application can scale).


note that like hyperthreading, the strands aren't full processors, 
their efficiancy depends on how much other threads shareing the core 
stall waiting for external things. 
Exactly.  


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] postgresql performance tuning

2005-12-06 Thread Alan Stange

Vivek Khera wrote:


On Dec 6, 2005, at 11:14 AM, Ameet Kini wrote:

need for vacuums. However, it'd be great if there was a similar 
automatic

reindex utility, like say, a pg_autoreindex daemon. Are there any plans
for this feature?  If not, then would cron scripts be the next best


what evidence do you have that you are suffering index bloat?  or are 
you just looking for solutions to problems that don't exist as an 
academic exercise? :-) 


The files for the two indices on a single table used 7.8GB of space 
before a reindex, and 4.4GB after.   The table had been reindexed over 
the weekend and a vacuum was completed on the table about 2 hours ago.


The two indices are now 3.4GB smaller.   I don't think this counts as 
bloat, because of our use case.  Even so, we reindex our whole database 
every weekend.


-- Alan


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] postgresql performance tuning

2005-12-06 Thread Alan Stange

Tom Lane wrote:

Alan Stange [EMAIL PROTECTED] writes:
  

Vivek Khera wrote:


what evidence do you have that you are suffering index bloat?
  


  
The files for the two indices on a single table used 7.8GB of space 
before a reindex, and 4.4GB after.



That's not bloat ... that's pretty nearly in line with the normal
expectation for a btree index, which is about 2/3rds fill factor.
If the compacted index were 10X smaller then I'd agree that you have
a bloat problem.
  

I wrote I don't think this counts as bloat  I still don't.

-- Alan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-23 Thread Alan Stange

Luke Lonergan wrote:
Why not contribute something - put up proof of your stated 8KB versus 
32KB page size improvement.


I did observe that 32KB block sizes were a significant win for our 
usage patterns.   It might be a win for any of the following reasons:


0) The preliminaries:   ~300GB database with about ~50GB daily 
turnover.   Our data is fairly reasonably grouped.  If we're getting one 
item on a page we're usually looking at the other items as well.


1) we can live with a smaller FSM size.  We were often leaking pages 
with a 10M page FSM setting.  With 32K pages, a 10M FSM size is 
sufficient.   Yes, the solution to this is run vacuum more often, but 
when the vacuum was taking 10 hours at a time, that was hard to do.


2) The typical datum size in our largest table is about 2.8KB, which is 
more than 1/4 page size thus resulting in the use of a toast table.   
Switching to 32KB pages allows us to get a decent storage of this data 
into the main tables, thus avoiding another table and associated large 
index.   Not having the extra index in memory for a table with 90M rows 
is probably beneficial.


3) vacuum time has been substantially reduced.  Vacuum analyze now run 
in the 2 to 3 hour range depending on load.


4) less cpu time spent in the kernel.  We're basically doing 1/4 as many 
system calls.  

Overall the system has now been working well.  We used to see the 
database being a bottleneck at times, but now it's keeping up nicely.


Hope this helps.

Happy Thanksgiving!

-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Alan Stange

Luke,

- XFS will probably generate better data rates with larger files.   You 
really need to use the same file size as does postgresql.  Why compare 
the speed to reading a 16G file and the speed to reading a 1G file.   
They won't be the same.  If need be, write some code that does the test 
or modify lmdd to read a sequence of 1G files.   Will this make a 
difference?  You don't know until you do it.   Any time you cross a 
couple of 2^ powers in computing, you should expect some differences.


- you did umount the file system before reading the 16G file back in?   
Because if you didn't then your read numbers are possibly garbage.   
When the read began, 8G of the file was in memory.   You'd be very naive 
to think that somehow the read of the first 8GB somehow flushed that 
cached data out of memory.  After all, why would the kernel flush pages 
from file X when you're in the middle of a sequential read of...file 
X?   I'm not sure how Linux handles this, but Solaris would've found the 
8G still in memory.


- What was the hardware and disk configuration on which these numbers 
were generated?   For example, if you have a U320 controller, how did 
the read rate become larger than 320MB/s?


- how did the results change from before?   Just posting the new results 
is misleading given all the boasting we've had to read about your past 
results.


- there are two results below for writing to ext2:  one at 209 MB/s and 
one at 113MB/s.  Why are they different?


- what was the cpu usage during these tests?   We see postgresql doing 
200+MB/s of IO.   You've claimed many times that the machine would be 
compute bound at lower IO rates, so how much idle time does the cpu 
still have?


- You wrote:  We'll do a 16GB table size to ensure that we aren't 
reading from the read cache.   Do you really believe that??   You have 
to umount the file system before each test to ensure you're really 
measuring the disk IO rate.   If I'm reading your results correctly, it 
looks like you have three results for ext and xfs, each of which is 
faster than the prior one.  If I'm reading this correctly, then it looks 
like one is clearly reading from the read cache.


- Gee, it's so nice of you to drop your 120MB/s observation.  I guess my 
reading at 300MB/s wasn't convincing enough.  Yeah, I think it was the 
cpus too...


- I wouldn't focus on the flat 64% of the data rate number.  It'll 
probably be different on other systems.


I'm all for testing and testing.   It seems you still cut a corner 
without umounting the file system first.  Maybe I'm a little too old 
school on this, but I wouldn't spend a dime until you've done the 
measurements correctly. 

Good Luck. 


-- Alan



Luke Lonergan wrote:

Alan,

Looks like Postgres gets sensible scan rate scaling as the filesystem speed
increases, as shown below.  I'll drop my 120MB/s observation - perhaps CPUs
got faster since I last tested this.

The scaling looks like 64% of the I/O subsystem speed is available to the
executor - so as the I/O subsystem increases in scan rate, so does Postgres'
executor scan speed.

So that leaves the question - why not more than 64% of the I/O scan rate?
And why is it a flat 64% as the I/O subsystem increases in speed from
333-400MB/s?

- Luke
 
= Results ===


Unless noted otherwise all results posted are for block device readahead set
to 16M using blockdev --setra=16384 block_device.  All are using the
2.6.9-11 Centos 4.1 kernel.

For those who don't have lmdd, here is a comparison of two results on an
ext2 filesystem:


[EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile
bs=8k count=80  sync)
80+0 records in
80+0 records out

real0m33.057s
user0m0.116s
sys 0m13.577s

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=80 sync=1
6553.6000 MB in 31.2957 secs, 209.4092 MB/sec

real0m33.032s
user0m0.087s
sys 0m13.129s


So lmdd with sync=1 is equivalent to a sync after a dd.

I use 2x memory with dd for the *READ* performance testing, but let's make
sure things are synced on both write and read for this set of comparisons.

First, let's test ext2 versus ext3, data=ordered, versus xfs:


16GB write, then read

---
ext2:
---
[EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k
count=200 sync=1
16384. MB in 144.2670 secs, 113.5672 MB/sec

[EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k
count=200 sync=1
16384. MB in 49.3766 secs, 331.8170 MB/sec

---
ext3, data=ordered:

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-22 Thread Alan Stange

Bruce Momjian wrote:

Greg Stark wrote:
  

Alan Stange [EMAIL PROTECTED] writes:



The point your making doesn't match my experience with *any* storage or program
I've ever used, including postgresql.   Your point suggests that the storage
system is idle  and that postgresql is broken because it isn't able to use the
resources available...even when the cpu is very idle.  How can that make sense?
  

Well I think what he's saying is that Postgres is issuing a read, then waiting
for the data to return. Then it does some processing, and goes back to issue
another read. The CPU is idle half the time because Postgres isn't capable of
doing any work while waiting for i/o, and the i/o system is idle half the time
while the CPU intensive part happens.

(Consider as a pathological example a program that reads 8k then sleeps for
10ms, and loops doing that 1,000 times. Now consider the same program
optimized to read 8M asynchronously and sleep for 10s. By the time it's
finished sleeping it has probably read in all 8M. Whereas the program that
read 8k in little chunks interleaved with small sleeps would probably take
twice as long and appear to be entirely i/o-bound with 50% iowait and 50%
idle.)

It's a reasonable theory and it's not inconsistent with the results you sent.
But it's not exactly proven either. Nor is it clear how to improve matters.
Adding additional threads to handle the i/o adds an enormous amount of
complexity and creates lots of opportunity for other contention that could
easily eat all of the gains.



Perfect summary.  We have a background writer now.  Ideally we would
have a background reader, that reads-ahead blocks into the buffer cache.
The problem is that while there is a relatively long time between a
buffer being dirtied and the time it must be on disk (checkpoint time),
the read-ahead time is much shorter, requiring some kind of quick
create a thread approach that could easily bog us down as outlined
above.

Right now the file system will do read-ahead for a heap scan (but not an
index scan), but even then, there is time required to get that kernel
block into the PostgreSQL shared buffers, backing up Luke's observation
of heavy memcpy() usage.

So what are our options?  mmap()?  I have no idea.  Seems larger page
size does help.
For sequential scans, you do have a background reader.  It's the 
kernel.  As long as you don't issue a seek() between read() calls, the 
kernel will get the hint about sequential IO and begin to perform a read 
ahead for you.  This is where the above analysis isn't quite right:  
while postgresql is processing the returned data from the read() call, 
the kernel has also issued reads as part of the read ahead, keeping the 
device busy while the cpu is busy.  (I'm assuming these details for 
Linux; Solaris/UFS does work this way).  Issue one seek on the file and 
the read ahead algorithm will back off for a while.   This was my point 
about some descriptions of how the system works not being sensible.


If your goal is sequential IO, then one must use larger block sizes.   
No one would use 8KB IO for achieving high sequential IO rates.   Simply 
put, read() is about the slowest way to get 8KB of data. Switching 
to 32KB blocks reduces all the system call overhead by a large margin.  
Larger blocks would be better still, up to the stripe size of your 
mirror.   (Of course, you're using a mirror and not raid5 if you care 
about performance.)


I don't think the memcpy of data from the kernel to userspace is that 
big of an issue right now.  dd and all the high end network interfaces 
manage OK doing it, so I'd expect postgresql to do all right with it now 
yet too.   Direct IO will avoid that memcpy, but then you also don't get 
any caching of the files in memory.  I'd be more concerned about any 
memcpy calls or general data management within postgresql.Does 
postgresql use the platform specific memcpy() in libc?  Some care might 
be needed to ensure that the memory blocks within postgresql are all 
properly aligned to make sure that one isn't ping-ponging cache lines 
around (usually done by padding the buffer sizes by an extra 32 bytes or 
L1 line size).   Whatever you do, all the usual high performance 
computing tricks should be used prior to considering any rewriting of 
major code sections.


Personally, I'd like to see some detailed profiling being done using 
hardware counters for cpu cycles and cache misses, etc.   Given the poor 
quality of work that has been discussed here in this thread, I don't 
have much confidence in any other additional results at this time.   
None of the analysis would be acceptable in any environment in which 
I've worked.   Be sure to take a look at Sun's free Workshop tools as 
they are excellent for this sort of profiling and one doesn't need to 
recompile to use them.If I get a little time in the next week or two 
I might take a crack at this.


Cheers,

-- Alan


---(end

Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Alan Stange

Luke Lonergan wrote:

OK - slower this time:
We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU
machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but
which all are capped at 120MB/s when doing sequential scans with different
versions of Postgres.
  
Postgresql issues the exact same sequence of read() calls as does dd.   
So why is dd so much faster?


I'd be careful with the dd read of a 16GB file on an 8GB system.  Make 
sure you umount the file system first, to make sure all of the file is 
flushed from memory.   Some systems use a freebehind on sequential reads 
to avoid flushing memory...and you'd find that 1/2 of your 16GB file is 
still in memory.   The same point also holds for the writes:  when dd 
finishes not all the data is on disk.   You need to issue a sync() call 
to make that happen.  Use lmdd to ensure that the data is actually all 
written.   In other words, I think your dd results are possibly misleading.


It's trivial to demonstrate:

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80
80+0 records in
80+0 records out

real0m13.780s
user0m0.134s
sys 0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed 
of 200MB/s peak.


Now, you might say that you wrote a 16GB file on an 8 GB machine so this 
isn't an issue.   It does make your dd numbers look fast as some of the 
data will be unwritten.



I'd also suggest running dd on the same files as postgresql.  I suspect 
you'd find that the layout of the postgresql files isn't that good as 
they are grown bit by bit, unlike the file created by simply dd'ing a 
large file.



Understand my point: It doesn't matter that there is idle or iowait on the
CPU, the postgres executor is not able to drive the I/O rate for two
reasons: there is a lot of CPU used for the scan (the 40% you reported) and
a lack of asynchrony (the iowait time).  That means that by speeding up the
CPU you only reduce the first part, but you don't fix the second and v.v.

With more aggressive readahead, the second problem (the I/O asynchrony) is
handled better by the Linux kernel and filesystem.  That's what we're seeing
with XFS.


I think your point doesn't hold up.  Every time you make it, I come away 
posting another result showing it to be incorrect.


The point your making doesn't match my experience with *any* storage or 
program I've ever used, including postgresql.   Your point suggests that 
the storage system is idle  and that postgresql is broken because it 
isn't able to use the resources available...even when the cpu is very 
idle.  How can that make sense?   The issue here is that the storage 
system is very active doing reads on the files...which might be somewhat 
poorly allocated on disk because postgresql grows the tables bit by bit.


I had the same readahead in Reiser and in XFS.   The XFS performance was 
better because XFS does a better job of large file allocation on disk, 
thus resulting in many fewer seeks (generated by the file system itself) 
to read the files back in.   As an example, some file systems like UFS 
purposely scatter large files across cylinder groups to avoid forcing 
large seeks on small files; one can tune this behavior so that large 
files are more tightly allocated.




Of course, because this is engineering, I have another obligatory data 
point:   This time it's a 4.2GB table using 137,138  32KB pages with 
nearly 41 million rows.


A select count(1) on the table completes in 14.6 seconds, for an 
average read rate of 320 MB/s. 

One cpu was idle, the other averaged 32% system time and 68 user time 
for the 14 second period.   This is on a 2.2Ghz Opteron.   A faster cpu 
would show increased performance as I really am cpu bound finally. 

Postgresql is clearly able to issue the relevant sequential read() 
system calls and sink the resulting data without a problem if the file 
system is capable of providing the data.  It can do this up to a speed 
of ~300MB/s on this class of system.   Now it should be fairly simple to 
tweak the few spots where some excess memory copies are being done and 
up this result substantially.  I hope postgresql is always using the 
libc memcpy as that's going to be a lot faster then some private routine.


-- Alan



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-21 Thread Alan Stange

Luke,

it's time to back yourself up with some numbers.   You're claiming the 
need for a significant rewrite of portions of postgresql and you haven't 
done the work to make that case. 

You've apparently made some mistakes on the use of dd to benchmark a 
storage system.   Use lmdd and umount the file system before the read 
and post your results.  Using a file 2x the size of memory doesn't work 
corectly.  You can quote any other numbers you want, but until you use 
lmdd correctly you should be ignored.  Ideally, since postgresql uses 
1GB files, you'll want to use 1GB files for dd as well.


Luke Lonergan wrote:

Alan,

On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

$ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80
80+0 records in
80+0 records out

real0m13.780s
user0m0.134s
sys 0m13.510s

Oops.   I just wrote 470MB/s to a file system that has peak write speed
of 200MB/s peak.


How much RAM on this machine?
  
Doesn't matter.  The result will always be wrong without a call to 
sync() or fsync() before the close() if you're trying to measure the 
speed of the disk subsystem.   Add that sync() and the result will be 
correct for any memory size.  Just for completeness:  Solaris implicitly 
calls sync() as part of close.   Bonnie used to get this wrong, so 
quoting Bonnie isn't any good.   Note that on some systems using 2x 
memory for these tests is almost OK.  For example, Solaris used to have 
a hiwater mark that would throttle processes and not allow more than a 
few 100K of  writes to be outstanding on a file.  Linux/XFS clearly 
allows a lot of write data to be outstanding.  It's best to understand 
the tools and know what they do and why they can be wrong than simply 
quoting some other tool that makes the same mistakes.


I find that postgresql is able to achieve about 175MB/s on average from 
a system capable of delivering 200MB/s peak and it does this with a lot 
of cpu time to spare.   Maybe dd can do a little better and deliver 
185MB/s.If I were to double the speed of my IO system, I might find 
that a single postgresql instance can sink about 300MB/s of data (based 
on the last numbers I posted).  That's why I have multi-cpu opterons and 
more than one query/client as they soak up the remaining IO capacity.


It is guaranteed that postgresql will hit some threshold of performance 
in the future and possible rewrites of some core functionality will be 
needed, but no numbers posted here so far have made the case that 
postgresql is in trouble now. In the mean time, build balanced 
systems with cpus that match the capabilities of the storage subsystems, 
use 32KB block sizes for large memory databases that are doing lots of 
sequential scans, use file systems tuned for large files, use opterons, etc.



As always, one has to post some numbers.   Here's an example of how dd 
doesn't do what you might expect:


mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k
16.7772 MB in 0.0235 secs, 714.5931 MB/sec

mite02:~ # lmdd  if=internal of=/fidb2/bigfile bs=8k count=2k sync=1
16.7772 MB in 0.1410 secs, 118.9696 MB/sec

Both numbers are correct.  But one measures the kernels ability to 
absorb 2000 8KB writes with no guarantee that the data is on disk and 
the second measures the disk subsystems ability to write 16MB of data.  
dd is equivalent to the first result.  You can't use the first type of 
result and complain that postgresql is slow.  If you wrote 16G of data 
on a machine with 8G memory then your dd result is possibly too fast by 
a factor of two as 8G of the data might not be on disk yet.  We won't 
know until you post some results.


Cheers,

-- Alan


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Alan Stange

William Yu wrote:

Alan Stange wrote:

Luke Lonergan wrote:

The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as 


iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.


I'm sure the the theory is nice but here's my experience with iowait 
just a minute ago. I run Linux/XFce as my desktop -- decided I wanted 
to lookup some stuff in Wikipedia under Mozilla and my computer system 
became completely unusable for nearly a minute while who knows what 
Mozilla was doing. (Probably loading all the language packs.) I could 
not even switch to IRC (already loaded) to chat with other people 
while Mozilla was chewing up all my disk I/O.


So I went to another computer, connected to mine remotely (slow...) 
and checked top. 90% in the wa column which I assume is the iowait 
column. It may be idle in theory but it's not a very useful idle -- 
wasn't able to switch to any programs already running, couldn't click 
on the XFce launchbar to run any new programs.


So, you have a sucky computer.I'm sorry, but iowait is still idle 
time, whether you believe it or not.


-- Alan


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-20 Thread Alan Stange

Greg Stark wrote:

Alan Stange [EMAIL PROTECTED] writes:

  

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.
  


I don't think that's true. If the syscall was preemptable then it wouldn't
show up under iowait, but rather idle. The time spent in iowait is time in
uninterruptable sleeps where no other process can be scheduled.
  
That would be wrong.   The time spent in iowait is idle time.  The 
iowait stat would be 0 on a machine with a compute bound runnable 
process available for each cpu.


Come on people, read the man page or look at the source code.   Just 
stop making stuff up.



  

iowait time is idle time. Period.   This point has been debated endlessly for
Solaris and other OS's as well.

Here's the man page:
  %iowait
 Show  the  percentage  of  time that the CPU or CPUs were
 idle during which the system had an outstanding disk  I/O
 request.

If the system had some other cpu bound work to perform you wouldn't ever see
any iowait time.  Anyone claiming the cpu was 100% busy on the sequential scan
using the one set of numbers I posted is misunderstanding the actual metrics.



That's easy to test. rerun the test with another process running a simple C
program like main() {while(1);} (or two invocations of that on your system
because of the extra processor). I bet you'll see about half the percentage of
iowait because postres will get half as much opportunity to schedule i/o. If
what you are saying were true then you should get 0% iowait.
Yes, I did this once about 10 years ago.   But instead of saying I bet 
and guessing at the result, you should try it yourself. Without 
guessing, I can tell you that the iowait time will go to 0%.  You can do 
this loop in the shell, so there's no code to write.  Also, it helps to 
do this with the shell running at a lower priority.


-- Alan



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Alan Stange

Luke Lonergan wrote:

Alan,

On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy
and 60% idle (aka iowait in the above numbers).



The aka iowait is the problem here - iowait is not idle (otherwise it
would be in the idle column).

Iowait is time spent waiting on blocking io calls.  As another poster
pointed out, you have a two CPU system, and during your scan, as predicted,
one CPU went 100% busy on the seq scan.  During iowait periods, the CPU can
be context switched to other users, but as I pointed out earlier, that's not
useful for getting response on decision support queries.
  
iowait time is idle time. Period.   This point has been debated 
endlessly for Solaris and other OS's as well.


Here's the man page:
 %iowait
Show  the  percentage  of  time that the CPU or 
CPUs were
idle during which the system had an outstanding 
disk  I/O

request.

If the system had some other cpu bound work to perform you wouldn't ever 
see any iowait time.  Anyone claiming the cpu was 100% busy on the 
sequential scan using the one set of numbers I posted is 
misunderstanding the actual metrics.



Thanks for your data, it exemplifies many of the points brought up:
- Lots of disks and expensive I/O hardware does not help improve performance
on large table queries because I/O bandwidth does not scale beyond
110-120MB/s on the fastest CPUs
  
I don't think that is the conclusion from anecdotal numbers I posted.   
This file subsystem doesn't perform as well as expected for any tool. 
Bonnie, dd, star, etc., don't get a better data rate either.   In fact, 
the storage system wasn't built for performance; it was build to 
reliably hold a big chunk of data.   Even so,  postgresql is reading at 
130MB/s on it, using about 30% of a single cpu, almost all of which was 
system time.   I would get the same 130MB/s on a system with cpus that 
were substantially slower; the limitation isn't the cpus, or 
postgresql.  It's the IO system that is poorly configured for this test, 
not postgresqls ability to use it.


In fact, given the numbers I posted, it's clear this system could 
handily generate more than 120 MB/s using a single cpu given a better IO 
subsystem;  it has cpu time to spare.   A simple test can be done:   
build the database in /dev/shm and time the scans.  It's the same read() 
system call being used and now one has made the IO system infinitely 
fast.  The claim is being made that standard postgresql is unable to 
generate more than 120MB/s of IO on any IO system due to an inefficient 
use of the kernel API and excessive memory copies, etc.  Having the 
database be on a ram based file system is an example of expensive IO 
hardware and all else would be the same.   Hmmm, now that I think about 
this,  I could throw a medium sized table onto /dev/shm using 
tablespaces on one of our 8GB linux boxes.So why is this experiment 
not valid, or what is it about the above assertion that I am missing?



Anyway, if one cares about high speed sequential IO, then one should use 
a much larger block size to start.   Using 8KB IOs is inappropriate for 
such a configuration.  We happen to be using 32KB blocks on our largest 
database and it's been the best move for us.


-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-19 Thread Alan Stange
Another data point. 

We had some down time on our system today to complete some maintenance 
work.  It took the opportunity to rebuild the 700GB file system using 
XFS instead of Reiser.


One iostat output for 30 seconds is

avg-cpu:  %user   %nice%sys %iowait   %idle
  1.580.00   19.69   31.94   46.78

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 343.73175035.73   277.555251072   8326

while doing a select count(1) on the same large table as before.   
Subsequent iostat output all showed that this data rate was being 
maintained.  The system is otherwise mostly idle during this measurement.


The sequential read rate is 175MB/s.  The system is the same as earlier, 
one cpu is idle and the second is ~40% busy doing the scan and ~60% 
idle.   This is  postgresql 8.1rc1, 32KB block size.  No tuning except 
for using a 1024KB read ahead.


The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel 
controller).  I see no reason why this configuration wouldn't generate 
higher IO rates if a faster IO connection were available.


Can you explain again why you think there's an IO ceiling of 120MB/s 
because I really don't understand?


-- Alan



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Luke Lonergan wrote:

Alan,

On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

That's interesting, as I occasionally see more than 110MB/s of
postgresql IO on our system.  I'm using a 32KB block size, which has
been a huge win in performance for our usage patterns.   300GB database
with a lot of turnover.  A vacuum analyze now takes about 3 hours, which
is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory,
Linux 2.6.11, FC drives.



300GB / 3 hours = 27MB/s.
  
That's 3 hours under load, with 80 compute clients beating on the 
database at the same time.   We have the stats turned way up, so the 
analyze tends to read a big chunk of the tables a second time as 
well.We typically don't have three hours a day of idle time.


-- Alan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Richard Huxton wrote:

Dave Cramer wrote:


On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote:

Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound  
after

110MB/s of I/O.  This is true of Postgres 7.4, 8.0 and 8.1.

A $1,000 system with one CPU and two SATA disks in a software RAID0  
will
perform exactly the same as a $80,000 system with 8 dual core CPUs  
and the
world's best SCSI RAID hardware on a large database for decision  
support

(what the poster asked about).



Now there's an interesting line drawn in the sand. I presume you 
have  numbers to back this up ?


This should draw some interesting posts.


That's interesting, as I occasionally see more than 110MB/s of 
postgresql IO on our system.  I'm using a 32KB block size, which has 
been a huge win in performance for our usage patterns.   300GB database 
with a lot of turnover.  A vacuum analyze now takes about 3 hours, which 
is much shorter than before.  Postgresql 8.1, dual opteron, 8GB memory, 
Linux 2.6.11, FC drives.


-- Alan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Luke Lonergan wrote:

Alan,

On 11/18/05 8:13 AM, Alan Stange [EMAIL PROTECTED] wrote:

I told you in my initial post that I was observing numbers in
excess of
what you claiming, but you seemed to think I didn't know how to
measure
an IO rate.

Prove me wrong, post your data.

I should note too that our system uses about 20% of a single cpu when
performing a table scan at 100MB/s of IO. I think you claimed the
system would be cpu bound at this low IO rate.


See above.
Here's the output from one iteration of iostat -k 60 while the box is 
doing a select count(1) on a 238GB table.


avg-cpu:  %user   %nice%sys %iowait   %idle
  0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was 
being used.   The remainder being idle.


We've done nothing fancy and achieved results you claim shouldn't be 
possible.  This is a system that was re-installed yesterday, no tuning 
was done to the file systems, kernel or storage array.


What am I doing wrong?

9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO 
(for a DOE lab).   And now I don't know what I'm doing,


Cheers,

-- Alan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Luke Lonergan wrote:

Alan,

On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote:

  

Here's the output from one iteration of iostat -k 60 while the box is
doing a select count(1) on a 238GB table.

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was
being used.   The remainder being idle.



Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
system otherwise idle?
  
Actually, this was dual cpu and there was other activity during the full 
minute, but it was on other file devices, which I didn't include in the 
above output.   Given that, and given what I see on the box now I'd 
raise the 20% to 30% just to be more conservative.  It's all in the 
kernel either way; using a different scheduler or file system would 
change that result.  Even better would be using direct IO to not flush 
everything else from memory and avoid some memory copies from kernel to 
user space.  Note that almost none of the time is user time.  Changing 
postgresql won't change the cpu useage.


One IMHO obvious improvement would be to have vacuum and analyze only do 
direct IO.  Now they appear to be very effective memory flushing tools.  
Table scans on tables larger than say 4x memory should probably also use 
direct IO for reads.


 
  

We've done nothing fancy and achieved results you claim shouldn't be
possible.  This is a system that was re-installed yesterday, no tuning
was done to the file systems, kernel or storage array.



Are you happy with 130MB/s?  How much did you pay for that?  Is it more than
$2,000, or double my 2003 PC?
  
I don't know what the system cost.   It was part of block of dual 
opterons from Sun that we got some time ago.   I think the 130MB/s is 
slow given the hardware, but it's acceptable.  I'm not too price 
sensitive; I care much more about reliability, uptime, etc.  

 
  

What am I doing wrong?

9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO
(for a DOE lab).   And now I don't know what I'm doing,


Cool.  Would that be Sandia?

We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on
complex queries.
Disk?!  4 StorageTek tape silos.  That would be .002 TB/s.  One has to 
change how you think when you have that much data. And hope you don't 
have a fire, because there's no backup.   That work was while I was at 
BNL.   I believe they are now at 4PB of tape and 150TB of disk.


-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Luke Lonergan wrote:

opterons from Sun that we got some time ago.   I think the 130MB/s is
slow given the hardware, but it's acceptable.  I'm not too price
sensitive; I care much more about reliability, uptime, etc.


I don't know what the system cost. It was part of block of dual

Then I know what they cost - we have them too (V20z and V40z).  You should
be getting 400MB/s+ with external RAID.
Yes, but we don't.   This is where I would normally begin a rant on how 
craptacular Linux can be at times.  But, for the sake of this 
discussion, postgresql isn't reading the data any more slowly than does 
any other program.


And we don't have the time to experiment with the box.

I know it should be better, but it's good enough for our purposes at 
this time.


-- Alan


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-18 Thread Alan Stange

Greg Stark wrote:

Alan Stange [EMAIL PROTECTED] writes:

  

Luke Lonergan wrote:


Alan,

On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote:


  

Here's the output from one iteration of iostat -k 60 while the box is
doing a select count(1) on a 238GB table.

avg-cpu:  %user   %nice%sys %iowait   %idle
   0.990.00   17.97   32.40   48.64

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdd 345.95130732.53 0.007843952  0

We're reading 130MB/s for a full minute.  About 20% of a single cpu was
being used.   The remainder being idle.



Cool - thanks for the results.  Is that % of one CPU, or of 2?  Was the
system otherwise idle?

  
Actually, this was dual cpu 



I hate to agree with him but that looks like a dual machine with one CPU
pegged. Yes most of the time is being spent in the kernel, but you're still
basically cpu limited.

That said, 130MB/s is nothing to sneeze at, that's maxing out two high end
drives and quite respectable for a 3-disk stripe set, even reasonable for a
4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and
only getting 130MB/s then it does seem likely the cpu is actually holding you
back here.

Still it doesn't show Postgres being nearly so CPU wasteful as the original
poster claimed.
  
Yes and no.   The one cpu is clearly idle.   The second cpu is 40% busy 
and 60% idle (aka iowait in the above numbers).
Of that 40%, other things were happening as well during the 1 minute 
snapshot.   During some iostat outputs that I didn't post the cpu time 
was ~ 20%.


So, you can take your pick.   The single cpu usage is somewhere between 
20% and 40%.  As I can't remove other users of the system, it's the best 
measurement that I can make right now.


Either way, it's not close to being cpu bound.  This is with Opteron 
248, 2.2Ghz cpus.


Note that the storage system has been a bit disappointing:  it's an IBM 
Fast T600 with a 200MB/s fiber attachment.   It could be better, but 
it's not been the bottleneck in our work, so we haven't put any energy 
into it.  


It's all in the kernel either way; using a different scheduler or file
system would change that result. Even better would be using direct IO to not
flush everything else from memory and avoid some memory copies from kernel
to user space. Note that almost none of the time is user time. Changing
postgresql won't change the cpu useage.


Well changing to direct i/o would still be changing Postgres so that's
unclear. And there are plenty of more mundane ways that Postgres is
responsible for how efficiently or not the kernel is used. Just using fewer
syscalls to do the same amount of reading would reduce cpu consumption.
Absolutely.  This is why we're using a 32KB block size and also switched 
to using O_SYNC for the WAL syncing method. That's many MB/s that 
don't need to be cached in the kernel (thus evicting other data), and we 
avoid all the fysnc/fdatasync syscalls.


The purpose of direct IO isn't to make the vacuum or analyze faster, but 
to lessen their impact on queries with someone waiting for the 
results.   That's our biggest hit:  running a sequential scan on 240GB 
of data and flushing everything else out of memory.


Now that I'm think about this a bit, a big chunk of time is probably 
being lost in TLB misses and other virtual memory events that would be 
avoided if a larger page size was being used.


-- Alan 



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Is There Any Way ....

2005-10-24 Thread Alan Stange

Alex Turner wrote:

This is possible with Oracle utilizing the keep pool

alter table t_name storage ( buffer_pool keep);

If Postgres were to implement it's own caching system, this seems like
it would be easily to implement (beyond the initial caching effort).

Alex


On 10/24/05, Craig A. James [EMAIL PROTECTED] wrote:
  

Jim C. Nasby jnasby ( at ) pervasive ( dot ) com wrote:


Stefan Weiss wrote:
... IMO it would be useful to have a way to tell
PG that some tables were needed frequently, and should be cached if
possible. This would allow application developers to consider joins with
these tables as cheap, even when querying on columns that are
not indexed.


Why do you think you'll know better than the database how frequently
something is used? At best, your guess will be correct and PostgreSQL
(or the kernel) will keep the table in memory. Or, your guess is wrong
and you end up wasting memory that could have been used for something
else.

It would probably be better if you describe why you want to force this
table (or tables) into memory, so we can point you at more appropriate
solutions.
  

Or perhaps we could explain why we NEED to force these tables into memory, so 
we can point you at a more appropriate implementation.  ;-)

Ok, wittiness aside, here's a concrete example.  I have an application with one 
critical index that MUST remain in memory at all times.  The index's tablespace 
is about 2 GB.  As long as it's in memory, performance is excellent - a user's 
query takes a fraction of a second.  But if it gets swapped out, the user's 
query might take up to five minutes as the index is re-read from memory.

Now here's the rub.  The only performance I care about is response to queries 
from the web application.  Everything else is low priority.  But there is other 
activity going on.  Suppose, for example, that I'm updating tables, performing 
queries, doing administration, etc., etc., for a period of an hour, during 
which no customer visits the site.  The another customer comes along and 
performs a query.

At this point, no heuristic in the world could have guessed that I DON'T CARE 
ABOUT PERFORMANCE for anything except my web application.  The performance of 
all the other stuff, the administration, the updates, etc., is utterly 
irrelevant compared to the performance of the customer's query.

What actually happens is that the other activities have swapped out the critical index, and my 
customer waits, and waits, and waits... and goes away after a minute or two.  To solve this, we've 
been forced to purchase two computers, and mirror the database on both.  All administration and 
modification happens on the offline database, and the web application only uses the 
online database.  At some point, we swap the two servers, sync the two databases, and 
carry on.  It's a very unsatisfactory solution.
We have a similar problem with vacuum being the equivalent of 
continuously flush all system caches for a long time.  Our database is 
about 200GB in size and vacuums take hours and hours.   The performance 
is acceptable still, but only because we've hidden the latency in our 
application.


I've occasionally thought it would be good to have the backend doing a 
vacuum or analyze also call priocntl() prior to doing any real work to 
lower its priority.   We'll be switching to the 8.1 release ASAP just 
because the direct IO capabilities are appearing to be a win on our 
development system.


-- Alan


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance considerations for very heavy INSERT traffic

2005-09-12 Thread Alan Stange

Brandon Black wrote:




On 9/12/05, *PFC* [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:




- benchmarking something else than ext3
(xfs ? reiser3 ?)


We've had bad experiences under extreme and/or strange workloads with 
XFS here in general, although this is the first major postgresql 
project - the rest were with other applications writing to XFS.  Bad 
experiences like XFS filesystems detecting internal inconsistencies 
at runtime and unmounting themselves from within the kernel module 
(much to the dismay of applications with open files on the 
filesystem), on machines with validated good hardware.  It has made me 
leary of using anything other than ext3 for fear of stability 
problems.  Reiser3 might be worth taking a look at though.


Just one tidbit.   We tried XFS on a very active system similar to what 
you describe.   Dual opterons, 8GB memory, fiber channel drives, 2.6 
kernel, etc.   And the reliability was awful.   We spent a lot of time 
making changes one at a time to try and isolate the cause; when we 
switched out from XFS to ReiserFS our stability problems went away.


It may be the case that the XFS problems have all been corrected in 
newer kernels, but I'm not going to put too much effort into trying that 
again.



I recently built a postgres with 32KB block sizes and have been doing 
some testing.  For our particular workloads it has been a win.


-- Alan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] difference in plan between 8.0 and 8.1?

2005-08-26 Thread Alan Stange

Tom Lane wrote:

Alan Stange [EMAIL PROTECTED] writes:
  

 Unique  (cost=2717137.08..2771407.21 rows=10854026 width=8)
   -  Sort  (cost=2717137.08..2744272.14 rows=10854026 width=8)
 Sort Key: timeseriesid
 -  Bitmap Heap Scan on tbltimeseries  
(cost=48714.09..1331000.42 rows=10854026 width=8)

   Recheck Cond: (timeseriesid  0)
   -  Bitmap Index Scan on idx_timeseris  
(cost=0.00..48714.09 rows=10854026 width=0)

 Index Cond: (timeseriesid  0)
(7 rows)



  
I'm hoping someone can explain the new query plan (as I'm not sure I 
understand what it is doing).



The index scan is reading the index to find out which heap tuple IDs
(TIDs) the index says meet the condition.  It returns a bitmap of the
tuple locations (actually, an array of per-page bitmaps).  The heap
scan goes and fetches the tuples from the table, working in TID order
to avoid re-reading the same page many times, as can happen for ordinary
index scans.  Since the result isn't sorted, we have to do a sort to get
it into the correct order for the Unique step.

Because it avoids random access to the heap, this plan can be a lot
faster than a regular index scan.  I'm not sure at all that 8.1 is
doing good relative cost estimation yet, though.  It would be
interesting to see EXPLAIN ANALYZE results for both ways.  (You can
use enable_bitmapscan and enable_indexscan to force the planner to pick
the plan it thinks is slower.)
Just to be clear.  The index is on the timeseriesid column.   Also, We 
usually have the where clause with some non-zero number.


Anyway, here's the basic query, with variations added on belowe:

fiasco=# explain analyze select timeseriesid from tbltimeseries where 
timeseriesid  0;
  QUERY 
PLAN


Bitmap Heap Scan on tbltimeseries  (cost=48906.82..1332935.19 
rows=10905949 width=8) (actual time=16476.337..787480.979 rows=10907853 
loops=1)

  Recheck Cond: (timeseriesid  0)
  -  Bitmap Index Scan on idx_timeseris  (cost=0.00..48906.82 
rows=10905949 width=0) (actual time=16443.585..16443.585 rows=10907853 
loops=1)

Index Cond: (timeseriesid  0)
Total runtime: 791340.341 ms
(5 rows)



Now add the order:

fiasco=# explain analyze select timeseriesid from tbltimeseries where 
timeseriesid  0 order by timeseriesid;
 
QUERY PLAN

--
Sort  (cost=2726087.93..2753352.81 rows=10905949 width=8) (actual 
time=821090.666..826353.054 rows=10913868 loops=1)

  Sort Key: timeseriesid
  -  Bitmap Heap Scan on tbltimeseries  (cost=48912.82..1332941.19 
rows=10905949 width=8) (actual time=16353.921..757075.349 rows=10913868 
loops=1)

Recheck Cond: (timeseriesid  0)
-  Bitmap Index Scan on idx_timeseris  (cost=0.00..48912.82 
rows=10905949 width=0) (actual time=16335.239..16335.239 rows=10913868 
loops=1)

  Index Cond: (timeseriesid  0)
Total runtime: 830829.145 ms
(7 rows)




and the distinct:

fiasco=# explain analyze select distinct timeseriesid from tbltimeseries 
where timeseriesid  0 order by timeseriesid;

QUERY PLAN


Unique  (cost=2726087.93..2780617.68 rows=10905949 width=8) (actual 
time=816938.970..831119.423 rows=10913868 loops=1)
  -  Sort  (cost=2726087.93..2753352.81 rows=10905949 width=8) (actual 
time=816938.967..822298.802 rows=10913868 loops=1)

Sort Key: timeseriesid
-  Bitmap Heap Scan on tbltimeseries  
(cost=48912.82..1332941.19 rows=10905949 width=8) (actual 
time=15866.736..752851.006 rows=10913868 loops=1)

  Recheck Cond: (timeseriesid  0)
  -  Bitmap Index Scan on idx_timeseris  
(cost=0.00..48912.82 rows=10905949 width=0) (actual 
time=15852.652..15852.652 rows=10913868 loops=1)

Index Cond: (timeseriesid  0)
Total runtime: 835558.312 ms
(8 rows)




Now the usual query from 8.0:

fiasco=# set enable_bitmapscan=false; explain analyze select distinct 
timeseriesid from tbltimeseries where timeseriesid  0 order by 
timeseriesid;

SET
 
QUERY PLAN

--
Unique  (cost=0.00..14971276.10 rows=10905949 width=8) (actual 
time=24.930..999645.638 rows=10913868 loops=1)
  -  Index Scan using

Re: [PERFORM] Caching by Postgres

2005-08-24 Thread Alan Stange

[EMAIL PROTECTED] wrote:

So then we move on to what 64-bit is really useful for. Obviously,
there is the arithmetic. If you were previously doing 64-bit
arithmetic through software, you will notice an immediate speed
improvement when doing it through hardware instead. If you have
a program that is scanning memory in any way, it may benefit from
64-bit instructions (for example - copying data 64-bit words at
a time instead of 32-bit words at a time). PostgreSQL might benefit
slightly from either of these, slightly balancing the performance
degradation of using more memory to store the pointers, and more
memory bandwidth the access the pointers.
  
At least on Sparc processors, v8 and newer, any double precision math 
(including longs) is performed with a single instruction, just like for 
a 32 bit datum.  Loads and stores of 8 byte datums are also handled via 
a single instruction.   The urban myth that 64bit math is 
different/better on a 64 bit processor is just that;  yes, some lower 
end processors would emulate/trap those instructions but that an 
implementation detail, not architecture.I believe that this is all 
true for other RISC processors as well.


The 64bit API on UltraSparcs does bring along some extra FP registers IIRC.


If, however, you happen to have a very large amount of physical memory
- more memory than is supported by a 32-bit system, but is supported
by your 64-bit system, then the operating system should be able to use
this additional physical memory to cache file system data pages, which
will benefit PostgreSQL if used with tables that are larger than the
memory supported by your 32-bit system, and which have queries which
require more pages than the memory supported by your 32-bit system to
be frequently accessed. If you have a huge database, with many clients
accessing the data, this would be a definate yes. With anything less,
it is a maybe, or a probably not.
  
Solaris, at least, provided support for far more than 4GB of physical 
memory on 32 bit kernels.  A newer 64 bit kernel might be more 
efficient, but that's just because the time was taken to support large 
page sizes and more efficient data structures.  It's nothing intrinsic 
to a 32 vs 64 bit kernel.


-- Alan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Read/Write block sizes

2005-08-23 Thread Alan Stange

Josh Berkus wrote:


Steve,

 


I would assume that dbt2 with STP helps minimize the amount of hours
someone has to invest to determine performance gains with configurable
options?
   



Actually, these I/O operation issues show up mainly with DW workloads, so the 
STP isn't much use there.   If I can ever get some of these machines back 
from the build people, I'd like to start testing some stuff.


One issue with testing this is that currently PostgreSQL doesn't support block 
sizes above 128K.  We've already done testing on that (well, Mark has) and 
the performance gains aren't even worth the hassle of remembering you're on a 
different block size (like, +4%).
 


What size database was this on?

What the Sun people have done with other DB systems is show that substantial 
performance gains are possible on large databases (100G) using block sizes 
of 1MB.   I believe that's possible (and that it probably makes more of a 
difference on Solaris than on BSD) but we can't test it without some hackery 
first.


We're running on a 100+GB database, with long streams of 8KB reads with 
the occasional _llseek().  I've been thinking about running with a 
larger blocksize with the expectation that we'd see fewer system calls 
and a bit more throughput.


read() calls are a very expensive way to get 8KB of memory (that we know 
is already resident) during scans.  One has to trap into the kernel, do 
the usual process state accounting, find the block, copy the memory to 
userspace, return back from the kernel to user space reversing all the 
process accounting, pick out the bytes one needs, and repeat all over 
again.That's quite a few sacrificial cache lines for 8KB.   Yeah, 
sure, Linux syscalls are fast, but they aren't that fast, and other 
operating systems (windows and solaris) have a bit more overhead on 
syscalls.


Regarding large blocks sizes on Solaris:  the Solaris folks can also use 
large memory pages and avoid a lot of the TLB overhead  from the VM 
system.  The various trapstat and cpustat commands can be quite 
interesting to look at when running any large application on a Solaris 
system. 

It should be noted that having a large shared memory segment can be a 
performance looser just from the standpoint of TLB thrashing.  O(GB) 
memory access patterns can take a huge performance hit in user space 
with 4K pages compared to the kernel which would be mapping the segmap 
(in Solaris parlance) with 4MB pages.


Anyway, I guess my point is that the balance between kernel managed vs. 
postgresql managed buffer isn't obvious at all.


-- Alan

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] BG writer question?

2005-08-11 Thread Alan Stange

Hello all,

I just was running strace in the writer process and I noticed this pattern:

select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
getppid()   = 4240
time(NULL)  = 1123773324
mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0x81000) = 0x69ea3000

semop(1409034, 0xc0bc, 1)   = 0
...seeks and writes...
munmap(0x69ea3000, 528384)  = 0
select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
getppid()   = 4240
time(NULL)  = 1123773324
mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 
0x81000) = 0x69ea3000

semop(1605648, 0xc0bc, 1)   = 0
...seeks and writes...
munmap(0x69ea3000, 528384)  = 0
select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)


why mmap and munmap each time?mmap and munmap are fairly expensive 
operations (on some systems), especially on multi cpu machines.  munmap 
in particular generally needs to issue cross calls to the other cpus to 
ensure any page mappings are invalidated. 


Just curious.

Thanks!

-- Alan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Alan Stange
Alex Turner wrote:
[snip]
 

Adding drives will not let you get lower response times than the average seek
time on your drives*. But it will let you reach that response time more often.
   

[snip]
I believe your assertion is fundamentaly flawed.  Adding more drives
will not let you reach that response time more often.  All drives are
required to fill every request in all RAID levels (except possibly
0+1, but that isn't used for enterprise applicaitons).  Most requests
in OLTP require most of the request time to seek, not to read.  Only
in single large block data transfers will you get any benefit from
adding more drives, which is atypical in most database applications. 
For most database applications, the only way to increase
transactions/sec is to decrease request service time, which is
generaly achieved with better seek times or a better controller card,
or possibly spreading your database accross multiple tablespaces on
seperate paritions.

My assertion therefore is that simply adding more drives to an already
competent* configuration is about as likely to increase your database
effectiveness as swiss cheese is to make your car run faster.
 

Consider the case of a mirrored file system with a mostly read() 
workload.  Typical behavior is to use a round-robin method for issueing 
the read operations to each mirror in turn, but one can use other 
methods like a geometric algorithm that will issue the reads to the 
drive with the head located closest to the desired track.Some 
systems have many mirrors of the data for exactly this behavior.   In 
fact, one can carry this logic to the extreme and have one drive for 
every cylinder in the mirror, thus removing seek latencies completely.  
In fact this extreme case would also remove the rotational latency as 
the cylinder will be in the disks read cache.  :-)   Of course, writing 
data would be a bit slow!

I'm not sure I understand your assertion that all drives are required 
to fill every request in all RAID levels.   After all, in mirrored 
reads only one mirror needs to read any given block of data, so I don't 
know what goal is achieved in making other mirrors read the same data.

My assertion (based on ample personal experience) is that one can 
*always* get improved performance by adding more drives.  Just limit the 
drives to use the first few cylinders so that the average seek time is 
greatly reduced and concatenate the drives together.  One can then build 
the usual RAID device out of these concatenated metadevices.  Yes, one 
is wasting lots of disk space, but that's life.   If your goal is 
performance, then you need to put your money on the table. The 
system will be somewhat unreliable because of the device count, 
additional SCSI buses, etc., but that too is life in the high 
performance world.

-- Alan
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] How to improve db performance with $7K?

2005-04-15 Thread Alan Stange
PFC wrote:

My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature.  Perhaps not quite as good, but reasonably
close.  It shouldn't be an orders-of-magnitude type difference.

And a controller card (or drive) has a lot less RAM to use as a 
cache /  queue for reordering stuff than the OS has, potentially the 
OS can us most  of the available RAM, which can be gigabytes on a big 
server, whereas in  the drive there are at most a few tens of 
megabytes...

However all this is a bit looking at the problem through the wrong 
end.  The OS should provide a multi-read call for the applications to 
pass a  list of blocks they'll need, then reorder them and read them 
the fastest  possible way, clustering them with similar requests from 
other threads.

Right now when a thread/process issues a read() it will block 
until the  block is delivered to this thread. The OS does not know if 
this thread  will then need the next block (which can be had very 
cheaply if you know  ahead of time you'll need it) or not. Thus it 
must make guesses, read  ahead (sometimes), etc...
All true.  Which is why high performance computing folks use 
aio_read()/aio_write() and load up the kernel with all the requests they 
expect to make. 

The kernels that I'm familiar with will do read ahead on files based on 
some heuristics:  when you read the first byte of a file the OS will 
typically load up several pages of the file (depending on file size, 
etc).  If you continue doing read() calls without a seek() on the file 
descriptor the kernel will get the hint that you're doing a sequential 
read and continue caching up the pages ahead of time, usually using the 
pages you just read to hold the new data so that one isn't bloating out 
memory with data that won't be needed again.  Throw in a seek() and the 
amount of read ahead caching may be reduced.

One point that is being missed in all this discussion is that the file 
system also imposes some constraints on how IO's can be done.  For 
example, simply doing a write(fd, buf, 1) doesn't emit a stream 
of sequential blocks to the drives.  Some file systems (UFS was one) 
would force portions of large files into other cylinder groups so that 
small files could be located near the inode data, thus avoiding/reducing 
the size of seeks.  Similarly, extents need to be allocated and the 
bitmaps recording this data usually need synchronous updates, which will 
require some seeks, etc.  Not to mention the need to update inode data, 
etc.  Anyway, my point is that the allocation policies of the file 
system can confuse the situation.

Also, the seek times one sees reported are an average.  One really needs 
to look at the track-to-track seek time and also the full stoke seek 
times.   It takes a *long* time to move the heads across the whole 
platter.  I've seen people partition drives to only use small regions of 
the drives to avoid long seeks and to better use the increased number of 
bits going under the head in one rotation.   A 15K drive doesn't need to 
have a faster seek time than a 10K drive because the rotational speed is 
higher.  The average seek time might be faster just because the 15K 
drives are smaller with fewer number of cylinders. 

-- Alan
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] PostgreSQL on Solaris 8 and ufs

2005-03-22 Thread Alan Stange
Brandon Metcalf wrote:
We've recently moved our pgsql installation and DBs to a Solaris 8
machine with striped and mirrored ufs filesystem that houses the DB
data.  We are now seeing terrible performance and the bottleneck is no
doubt disk I/O.
We've tried modifying a tunables related to ufs, but it doesn't seem
to be helping.
Is there anything we should be looking at that is specifically related
to ufs filesystems on Solaris 8 or possibly something in general that
would improve performance?
 

Well, Solaris 8 is a bit old now, so I don't remember all the details.   
But, if memory servers, Solaris 8 still has some high water and lo 
water tunables related to the amount of IO can be outstanding to a 
single file.

Try setting
set ufs:ufs_WRITES=0
in /etc/system and rebooting, which basically says any amount of disk 
IO can be outstanding.  There's a tunables doc on docs.sun.com that 
explains this option.

Also, logging UFS might help with some of the metadata requirements of 
UFS as well.  So, use mount -o logging or add the relevant entry in 
/etc/vfstab.

Of course, the best thing is Solaris 9 or 10, which would be much better 
for this sort of thing.

Hope this helps.
-- Alan
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] multi billion row tables: possible or insane?

2005-03-01 Thread Alan Stange
Ramon Bastiaans wrote:
I am doing research for a project of mine where I need to store 
several billion values for a monitoring and historical tracking system 
for a big computer system. My currect estimate is that I have to store 
(somehow) around 1 billion values each month (possibly more).

I was wondering if anyone has had any experience with these kind of 
big numbers of data in a postgres sql database and how this affects 
database design and optimization.

What would be important issues when setting up a database this big, 
and is it at all doable? Or would it be a insane to think about 
storing up to 5-10 billion rows in a postgres database.

The database's performance is important. There would be no use in 
storing the data if a query will take ages. Query's should be quite 
fast if possible.

I would really like to hear people's thoughts/suggestions or go see a 
shrink, you must be mad statements ;)
It just dawned on me that we're doing something that, while not the 
same, might be relevant.  One of our tables has ~85M rows in it 
according to the output from an explain select * from table.  I don't 
plan on trying a select count(*) any time soon :)We add and remove 
about 25M rows a day to/from this table which would be about 750M 
rows/month total.  Given our current usage of the database, it could 
handle a larger row/day rate without too much trouble.  (The problem 
isn't adding rows but deleting rows.)

   Column|   Type   | Modifiers
--+--+---
timeseriesid | bigint   |
bindata  | bytea|
binsize  | integer  |
rateid   | smallint |
ownerid  | smallint |
Indexes:
   idx_timeseries btree (timeseriesid)
In this case, each bytea entry is typically about  2KB of data, so the 
total table size is about 150GB, plus some index overhead.

A second table has ~100M rows according to explain select *.  Again it 
has about 30M rows added and removed / day.  

  Column   | Type  | Modifiers
+---+---
uniqid | bigint|
type   | character varying(50) |
memberid   | bigint|
tag| character varying(50) |
membertype | character varying(50) |
ownerid| smallint  |
Indexes:
   composite_memberid btree (memberid)
   composite_uniqid btree (uniqid)
There are some additional tables that have a few million rows / day of 
activity, so call it 60M rows/day added and removed.  We run a vacuum 
every day.

The box is an dual Opteron 248 from Sun.   Linux 2.6, 8GB of memory.  We 
use reiserfs.  We started with XFS but had several instances of file 
system corruption.  Obviously, no RAID 5.  The xlog is on a 2 drive 
mirror and the rest is on separate mirrored volume.  The drives are 
fiber channel but that was a mistake as the driver from IBM wasn't very 
good.

So, while we don't have a billion rows we do have ~200M total rows in 
all the tables and we're certainly running the daily row count that 
you'd need to obtain.   But scaling this sort of thing up can be tricky 
and your milage may vary.

In a prior career I ran a data intensive computing center and helped 
do some design work for a high energy physics experiment:  petabytes of 
data, big tape robots, etc., the usual Big Science toys.   You might 
take a look at ROOT and some of the activity from those folks if you 
don't need transactions and all the features of a general database like 
postgresql.

-- Alan
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Swapping on Solaris

2005-01-19 Thread Alan Stange
Mark Kirkwood wrote:
Kevin Schroeder wrote:

Ignoring the fact that the sort and vacuum numbers are really high, 
this is what Solaris shows me when running top:

Memory: 2048M real, 1376M free, 491M swap in use, 2955M swap free
Maybe check the swap usage with 'swap -l' which reports reliably if any
(device or file) swap is actually used.
I think Solaris 'top' does some strange accounting to calculate the
'swap in use' value (like including used memory).
It looks to me like you are using no (device or file) swap at all, and
have 1.3G of real memory free, so could in fact give Postgres more of 
it :-)
I suspect that free memory is in fact being used for the file system 
cache.   There were some changes in the meaning of free in Solaris 8 
and 9.   The memstat command gives a nice picture of memory usage on the 
system.   I don't think memstat came with Solaris 8, but you can get it 
from solarisinternals.com.   The Solaris Internals book is an excellent 
read as well; it explains all of this in gory detail. 

Note that files in /tmp are usually in a tmpfs file system.   These 
files may be the usage of swap that you're seeing (as they will be paged 
out on an active system with some memory pressure)

Finally, just as everyone suggests upgrading to newer postgresql 
releases, you probably want to get to a newer Solaris release. 

-- Alan
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Swapping on Solaris

2005-01-19 Thread Alan Stange
Kevin Schroeder wrote:
I suspect that the memory is being used to cache files as well since 
the email boxes are using unix mailboxes, for the time being.  With 
people checking their email sometimes once per minute I can see why 
Solaris would want to cache those files.  Perhaps my question would be 
more appropriate to a Solaris mailing list since what I really want to 
do is get Solaris to simply allow PostgreSQL to use more RAM and 
reduce the amount of RAM used for file caching.  I would have thought 
that Solaris gives some deference to a running application that's 
being swapped than for a file cache.

Is there any way to set custom parameters on Solaris' file-caching 
behavior to allow PostgreSQL to use more physical RAM?
Your explanation doesn't sound quite correct.   If postgresql malloc()'s 
some memory and uses it, the file cache will be reduced in size and the 
memory given to postgresql.   But if postgresql doesn't ask for or use 
the memory, then solaris is going to use it for something else.  There's 
nothing in Solaris that doesn't allow postgresql to use more RAM.

-- Alan
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Swapping on Solaris

2005-01-19 Thread Alan Stange
Kevin Schroeder wrote:
I take that back.  There actually is some paging going on.  I ran sar 
-g 5 10 and when a request was made (totally about 10 DB queries) my 
pgout/s jumped to 5.8 and my ppgout/s jumped to 121.8.  pgfree/s also 
jumped to 121.80.
I'm fairly sure that the pi and po numbers include file IO in Solaris, 
because of the unified VM and file systems.

-- Alan
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Swapping on Solaris

2005-01-19 Thread Alan Stange
Kevin Schroeder wrote:
I may be asking the question the wrong way, but when I start up 
PostgreSQL swap is what gets used the most of.  I've got 1282MB free 
RAM right now and and 515MB swap in use.  Granted, swap file usage 
probably wouldn't be zero, but I would guess that it should be a lot 
lower so something must be keeping PostgreSQL from using the free RAM 
that my system is reporting.  For example, one of my postgres 
processes is 201M in size but on 72M is resident in RAM.  That extra 
130M is available in RAM, according to top, but postgres isn't using it. 
The test you're doing doesn't measure what you think you're measuring.
First, what else is running on the machine?Note that some shared 
memory allocations do reserve backing pages in swap, even though the 
pages aren't currently in use.  Perhaps this is what you're measuring?  
swap -s has better numbers than top.

You'd be better by trying a reboot then starting pgsql and seeing what 
memory is used.

Just because you start a process and see the swap number increase 
doesn't mean that the new process is in swap.  It means some anonymous 
pages had to be evicted to swap to make room for the new process or some 
pages had to be reserved in swap for future use.   Typically a new 
process won't be paged out unless something else is causing enormous 
memory pressure...

-- Alan
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-15 Thread Alan Stange
Tom Lane wrote:
Kevin Brown [EMAIL PROTECTED] writes:
 

Hmm...something just occurred to me about this.
   

Would a hybrid approach be possible?  That is, use mmap() to handle
reads, and use write() to handle writes?
   

Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
Basically it says that there are no guarantees whatsoever if you try
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:
It is also unspecified whether write references to a memory region
mapped with MAP_SHARED are visible to processes reading the file and
whether writes to a file are visible to processes that have mapped the
modified portion of that file, except for the effect of msync().
It might work on particular OSes but I think depending on such behavior
would be folly...
We have some anecdotal experience along these lines:There was a set 
of kernel bugs in Solaris 2.6 or 7 related to this as well.   We had 
several kernel panics and it took a bit to chase down, but the basic 
feedback was oops.  we're screwed.   I've forgotten most of the 
details right now; the basic problem was a file was being read+written 
via mmap and read()/write() at (essentially) the same time from the same 
pid.   It would panic the system quite reliably.  I believe the bugs 
related to this have been resolved in Solaris, but it was unpleasant to 
chase that problem down...

-- Alan
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Excessive context switching on SMP Xeons

2004-10-07 Thread Alan Stange
Bill Montgomery wrote:
Alan Stange wrote:
Here's a few numbers from the Opteron 250.  If I get some time I'll 
post a more comprehensive comparison including some other systems.

The system is a Sun v20z.  Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB 
memory.   I did a compile and install of pg 8.0 beta 3.  I created a 
data base on a tmpfs file system and ran pgbench.  Everything was 
out of the box, meaning I did not tweak any config files.

I used this for pgbench:
$ pgbench -i -s 32
and this for pgbench invocations:
$ pgbench -s 32 -c 1 -t 1 -v
clients  tps  11290  2
1780   4176081680 
16   1376   32904

The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache, 
HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:

$ pgbench -i -s 32 pgbench
$ pgbench -s 32 -c 1 -t 1 -v
clients   tps   avg CS/sec
---  -  --
 1601  48,000
 2889  77,000
 4   1006  80,000
 8985  59,000
16966  47,000
32913  46,000
Far less performance that the Dual Opterons with a low number of 
clients, but the gap narrows as the number of clients goes up. Anyone 
smarter than me care to explain?
boy, did Thunderbird ever botch the format of the table I entered...
I thought the falloff at 32 clients was a bit steep as well.   One 
thought that crossed my mind is that pgbench -s 32 -c 32 ... might not 
be valid.   From the pgbench README:

   -s scaling_factor
   this should be used with -i (initialize) option.
   number of tuples generated will be multiple of the
   scaling factor. For example, -s 100 will imply 10M
   (10,000,000) tuples in the accounts table.
   default is 1.  NOTE: scaling factor should be at least
   as large as the largest number of clients you intend
   to test; else you'll mostly be measuring update contention.
Another possible cause is the that pgbench process is cpu starved and 
isn't able to keep driving the postgresql processes.   So I ran pgbench 
from another system with all else the same.The numbers were a bit 
smaller but otherwise similar.

I then reran everything using -s 64:
clients   tps
1 1254
2 1645
4 1713
8 1548
161396
321060
Still starting to head down a bit.  In the 32 client case, the system 
was ~60% user time, ~25% sytem and ~15% idle. Anyway, the machine is 
clearly hitting some contention somewhere.   It could be in the tmpfs 
code, VM system, etc.

-- Alan


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Excessive context switching on SMP Xeons

2004-10-06 Thread Alan Stange
Greg Stark wrote:
Alan Stange [EMAIL PROTECTED] writes:
 

A few quick random observations on the Xeon v. Opteron comparison:
- running a dual Xeon with hyperthreading turned on really isn't the same as
having a quad cpu system. I haven't seen postgresql specific benchmarks, but
the general case has been that HT is a benefit in a few particular work
loads but with no benefit in general.
   

Part of the FUD with hyperthreading did have a kernel of truth that lied in
older kernels' schedulers. For example with Linux until recently the kernel
can easily end up scheduling two processes on the two virtual processors of
one single physical processor, leaving the other physical processor totally
idle.
With modern kernels' schedulers I would expect hyperthreading to live up to
its billing of adding 10% to 20% performance. Ie., a dual Xeon machine with
hyperthreading won't be as fast as four processors, but it should be 10-20%
faster than a dual Xeon without hyperthreading.
As with all things that will only help if you're bound by the right limited
resource to begin with. If you're I/O bound it isn't going to help. I would
expect Postgres with its heavy demand on memory bandwidth and shared memory
could potentially benefit more than usual from being able to context switch
during pipeline stalls.
 

All true.   I'd be surprised if HT on an older 2.8 Ghz Xeon with only a 
512K cache will see any real benefit.   The dual Xeon is already memory 
starved, now further increase the memory pressure on the caches (because 
the 512K is now shared by two virtual processors) and you probably 
won't see a gain.  It's memory stalls all around.  To be clear, the 
context switch in this case isn't a kernel context switch but a virtual 
cpu context switch.

The probable reason we see dual Opteron boxes way outperforming dual 
Xeons boxes is exactly because of Postgresql's heavy demand on memory.  
The Opteron's have a much better memory system.

A quick search on google or digging around in the comp.arch archives 
will provide lots of details.HP's web site has (had?) some 
benchmarks comparing these systems.  HP sells both Xeon and Opteron 
systems, so the comparison were quite fair.  Their numbers showed the 
Opteron handily outperfoming the Xeons.

-- Alan
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Excessive context switching on SMP Xeons

2004-10-06 Thread Alan Stange
Here's a few numbers from the Opteron 250.  If I get some time I'll post 
a more comprehensive comparison including some other systems.

The system is a Sun v20z.  Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB 
memory.   I did a compile and install of pg 8.0 beta 3.  I created a 
data base on a tmpfs file system and ran pgbench.  Everything was out 
of the box, meaning I did not tweak any config files.

I used this for pgbench:
$ pgbench -i -s 32
and this for pgbench invocations:
$ pgbench -s 32 -c 1 -t 1 -v
clients  tps  
11290  
21780   
41760
81680 
16   1376   
32904

How are these results useful?  In some sense, this is a speed of light 
number for the Opteron 250.   You'll never go faster on this system with 
a real storage subsystem involved instead of a tmpfs file system.   It's 
also a set of numbers that anyone else can reproduce as we don't have to 
deal with any differences in file systems, disk subsystems, networking, 
etc.   Finally, it's a set of results that anyone else can compute on 
Xeon's or other systems and make a simple (and naive) comparisons.

Just to stay on topic:   vmstat reported about 30K cs / second while 
this was running the 1 and 2 client cases.

-- Alan
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Excessive context switching on SMP Xeons

2004-10-05 Thread Alan Stange
A few quick random observations on the Xeon v. Opteron comparison:
-  running a dual Xeon with hyperthreading turned on really isn't the 
same as having a quad cpu system.   I haven't seen postgresql specific 
benchmarks, but the general case has been that HT is a benefit in a few 
particular work loads but with no benefit in general.

- We're running postgresql 8 (in production!) on a dual Opteron 250, 
Linux 2.6, 8GB memory, 1.7TB of attached fiber channel disk, etc.   This 
machine is fast.A dual 2.8 Ghz Xeon with 512K caches (with or 
without HT enabled) simlpy won't be in the same performance league as 
this dual Opteron system (assuming identical disk systems, etc).  We run 
a Linux 2.6 kernel because it scales under load so much better than the 
2.4 kernels.

The units we're using (and we have a lot of them) are SunFire v20z.  You 
can get a dualie Opteron 250 for $7K with 4GB memory from Sun.  My 
personal experience with this setup in a mission critical config is to 
not depend on 4 hour spare parts, but to spend the money and install the 
spare in the rack.   Naturally, one can go cheaper with slower cpus, 
different vendors, etc.

I don't care to go into the whole debate of Xeon v. Opteron here.   We 
also have a lot of dual Xeon systems. In every comparison I've done with 
our codes, the dual Opteron clearly outperforms the dual Xeon, when 
running on one and both cpus.

-- Alan

Josh Berkus wrote:
Bill,
 

I'd be thrilled to test it too, if for no other reason that to determine
whether what I'm experiencing really is the CS problem.
   

Hmmm ... Gavin's patch is built against 8.0, and any version of the patch 
would require linux 2.6, probably 2.6.7 minimum.   Can you test on that linux 
version?   Do you have the resources to back-port Gavin's patch?   

 

Fair enough. I never see nearly this much context switching on my dual
Xeon boxes running dozens (sometimes hundreds) of concurrent apache
processes, but I'll concede this could just be due to the more parallel
nature of a bunch of independent apache workers.
   

Certainly could be.  Heavy CSes only happen when you have a number of 
long-running processes with contention for RAM in my experience.  If Apache 
is dispatching thing quickly enough, they'd never arise.

 

Hence my desire for recommendations on alternate architectures ;-)
   

Well, you could certainly stay on Xeon if there's better support availability.  
Just get off Dell *650's.   

 

Being a 24x7x365 shop, and these servers being mission critical, I
require vendors that can offer 24x7 4-hour part replacement, like Dell
or IBM. I haven't seen 4-way 64-bit boxes meeting that requirement for
less than $20,000, and that's for a very minimally configured box. A
suitably configured pair will likely end up costing $50,000 or more. I
would like to avoid an unexpected expense of that size, unless there's
no other good alternative. That said, I'm all ears for a cheaper
alternative that meets my support and performance requirements.
   

No, you're going to pay through the nose for that support level.   It's how 
things work.

 

tps = 369.717832 (including connections establishing)
tps = 370.852058 (excluding connections establishing)
   

Doesn't seem too bad to me.   Have anything to compare it to?
What's in your postgresql.conf?
--Josh
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Anyone familiar with Apple Xserve RAID

2004-08-26 Thread Alan Stange
Doug McNaught wrote:
Kevin Barnard [EMAIL PROTECTED] writes:
 

  Actually you are both are right and wrong.  The XRaid uses
  FibreChannel to communicate to the host machine(s).  The Raid
  controller is a FibreChannel controller.  After that there is a
  FibreChannel to UltraATA conversion for each drive, separate ATA bus
  for each drive.
  What I am curious about is if this setup gets around ATA fsync
  problems, where the drive reports the write before it is actually
  performed.
   

Good point.
(a) The FC-ATA unit hopefully has a battery-backed cache, which
   would make the whole thing more robust against power loss.
(b) Since Apple is the vendor for the drive units, they can buy ATA
   drives that don't lie about cache flushes.  Whether they do or not
   is definitely a question.  ;)
 

FYI:http://developer.apple.com/technotes/tn/pdf/tn1040.pdf   a tech 
note on write cache flushing.

A bit dated now, but perhaps some other tech note from Apple has more 
recent information.

-- Alan
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] linux distro for better pg performance

2004-05-03 Thread Alan Stange
Joseph Shraibman wrote:
J. Andrew Rogers wrote:
Do these features make a difference?  Far more than you would 
imagine. On one postgres server I just upgraded, we went from a 3Ware 
8x7200-RPM
RAID-10 configuration to an LSI 320-2 SCSI 3x10k RAID-5, with 256M
Is raid 5 much faster than raid 10?  On a 4 disk array with 3 data 
disks and 1 parity disk, you have to write 4/3rds the original data, 
while on raid 10 you have to write 2 times the original data, so 
logically raid 5 should be faster. 
I think this comparison is a bit simplistic.   For example, most raid5 
setups have full stripes that are more than 8K  (the typical IO size in 
postgresql), so one might have to read in portions of the stripe in 
order to compute the parity.   The needed bits might be in some disk or 
controller cache;  if it's not then you lose.   If one is able to 
perform full stripe writes then the raid5 config should be faster for 
writes.

Note also that the mirror has 2 copies of the data, so that the read IOs 
would be divided across 2 (or more) spindles using round robin or a more 
advanced algorithm to reduce seek times. 

Of course, I might be completely wrong...
-- Alan
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[PERFORM] vacuum performance

2004-03-18 Thread Alan Stange
Hello all,

I have a question/observation about vacuum performance.  I'm running 
Solaris 9, pg 7.4.1.
The process in questions is doing a vacuum:

bash-2.05$ /usr/ucb/ps auxww | grep 4885
fiasco4885 19.1  3.7605896592920 ?O 19:29:44 91:38 postgres: 
fiasco fiasco [local] VACUUM

I do a truss on the process and see the output below looping over and 
over.   Note the constant opening and closing of the file 42064889.3.

Why the open/close cycle as opposed to caching the file descriptor 
somewhere?

If PG really does need to loop like this, it should be much faster to 
set the cwd and then open without the path in the file name.  You're 
forcing the kernel to do a lot of work walking the path, checking for 
nfs mounts, symlinks, etc.

Thanks!

-- Alan

open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x18F6E000, SEEK_SET)= 0x18F6E000
write(47, \0\0\0 zA9A3D9E8\0\0\0 .., 8192)   = 8192
close(47)   = 0
read(29, \0\0\0 }ED WF1B0\0\0\0 $.., 8192)= 8192
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x18F78000, SEEK_SET)= 0x18F78000
write(47, \0\0\0 zA9AC 090\0\0\0 .., 8192)   = 8192
close(47)   = 0
llseek(43, 0x26202000, SEEK_SET)= 0x26202000
read(43, \0\0\084 EC9FC P\0\0\0 ).., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x18F62000, SEEK_SET)= 0x18F62000
write(47, \0\0\0 zA9C2\bB8\0\0\0 .., 8192)   = 8192
close(47)   = 0
read(29, \0\0\0 }ED X1210\0\0\0 $.., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x18018000, SEEK_SET)= 0x18018000
write(47, \0\0\0 zA997ADB0\0\0\0 .., 8192)   = 8192
close(47)   = 0
llseek(43, 0x2620, SEEK_SET)= 0x2620
read(43, \0\0\084 EC4F5E8\0\0\0 ).., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
llseek(13, 13918208, SEEK_SET)  = 13918208
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x18F52000, SEEK_SET)= 0x18F52000
write(47, \0\0\0 zABE7 V10\0\0\0 .., 8192)   = 8192
close(47)   = 0
semop(46, 0xFFBFC5D0, 1)= 0
read(29, \0\0\0 }ED X 2 p\0\0\0 $.., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
llseek(43, 0x270DA000, SEEK_SET)= 0x270DA000
write(43, \0\0\087A2E8 #B8\0\0\0 ).., 8192)   = 8192
llseek(43, 0x261FE000, SEEK_SET)= 0x261FE000
read(43, \0\0\084 EC498\0\0\0\0 ).., 8192)= 8192
poll(0xFFBFC100, 0, 10) = 0
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
llseek(47, 0x1804A000, SEEK_SET)= 0x1804A000
write(47, \0\0\0 zAA0F8DE0\0\0\0 .., 8192)   = 8192
close(47)   = 0
read(29, \0\0\0 }ED X RD0\0\0\0 $.., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192)   = 8192
open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Insert Times

2004-01-27 Thread Alan Stange
PC Drew wrote:

I tested this out and saw no improvement:
 

I'd still suspect some class loading issues and HotSpot compilation 
issues are polluting your numbers.Try using a PreparedStatement to 
another table first in order to make sure that classes bytecode has been 
loaded.   There are some command line options to the JVM to have it 
print out some status info when it is loading classes and compiling 
methods; you might want to turn on those options as well.

-- Alan

EXPLAIN ANALYZE SELECT * FROM one;
Seq Scan on one  (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.50 
rows=51 loops=1)
Total runtime: 0.75 msec
EXPLAIN ANALYZE SELECT * FROM one;
Seq Scan on one  (cost=0.00..20.00 rows=1000 width=404) (actual time=0.06..0.50 
rows=51 loops=1)
Total runtime: 0.64 msec
EXPLAIN ANALYZE SELECT * FROM one;
Seq Scan on one  (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.40 
rows=51 loops=1)
Total runtime: 0.54 msec
EXPLAIN ANALYZE SELECT * FROM one;
Seq Scan on one  (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.41 
rows=51 loops=1)
Total runtime: 0.54 msec
EXPLAIN ANALYZE SELECT * FROM one;
Seq Scan on one  (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.41 
rows=51 loops=1)
Total runtime: 0.53 msec
EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah');
Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.02 rows=1 loops=1)
Total runtime: 0.85 msec
EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah');
Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1)
Total runtime: 0.15 msec
EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah');
Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1)
Total runtime: 0.14 msec
EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah');
Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1)
Total runtime: 0.12 msec
EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah');
Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.02 rows=1 loops=1)
Total runtime: 0.12 msec


-Original Message-
From:   Leeuw van der, Tim [mailto:[EMAIL PROTECTED]
Sent:   Tue 1/27/2004 12:38 AM
To: PC Drew; [EMAIL PROTECTED]
Cc: 
Subject:RE: [PERFORM] Insert Times
Hi,

My personal feeling on this is, that the long time taken for the first query
is for loading all sorts of libraries, JVM startup overhead etc.
What if you first do some SELECT (whatever), on a different table, to warm
up the JVM and the database?
regards,

--Tim

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
MATERIAL and is thus for use only by the intended recipient. If you received
this in error, please contact the sender and delete the e-mail and its
attachments from all computers. 



---(end of broadcast)---
TIP 8: explain analyze is your friend
 



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match