Re: [PERFORM] 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Gregory Stark wrote: A minute ago I said: AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit. It would be great to hear if you could catch the ear of the right people to get an implementation committed. Depending on how the i/o scheduler system is written it might not even be hard -- the Linux implementation of WILLNEED is all of 20 lines. I noticed after sending it that that's slightly unfair. The 20-line function calls another function (which calls another function) to do the real readahead work. That function (mm/readahead.c:__do_page_cache_readahead()) is 48 lines. It's implemented. I'm guessing it's not what you want to see though: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL and Ultrasparc T1
Jignesh K. Shah wrote: I guess it depends on what you term as your metric for measurement. If it is just one query execution time .. It may not be the best on UltraSPARC T1. But if you have more than 8 complex queries running simultaneously, UltraSPARC T1 can do well compared comparatively provided the application can scale also along with it. I just want to clarify one issue here. It's my understanding that the 8-core, 4 hardware thread (known as strands) system is seen as a 32 cpu system by Solaris. So, one could have up to 32 postgresql processes running in parallel on the current systems (assuming the application can scale). -- Alan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] PostgreSQL and Ultrasparc T1
David Lang wrote: On Tue, 20 Dec 2005, Alan Stange wrote: Jignesh K. Shah wrote: I guess it depends on what you term as your metric for measurement. If it is just one query execution time .. It may not be the best on UltraSPARC T1. But if you have more than 8 complex queries running simultaneously, UltraSPARC T1 can do well compared comparatively provided the application can scale also along with it. I just want to clarify one issue here. It's my understanding that the 8-core, 4 hardware thread (known as strands) system is seen as a 32 cpu system by Solaris. So, one could have up to 32 postgresql processes running in parallel on the current systems (assuming the application can scale). note that like hyperthreading, the strands aren't full processors, their efficiancy depends on how much other threads shareing the core stall waiting for external things. Exactly. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] postgresql performance tuning
Vivek Khera wrote: On Dec 6, 2005, at 11:14 AM, Ameet Kini wrote: need for vacuums. However, it'd be great if there was a similar automatic reindex utility, like say, a pg_autoreindex daemon. Are there any plans for this feature? If not, then would cron scripts be the next best what evidence do you have that you are suffering index bloat? or are you just looking for solutions to problems that don't exist as an academic exercise? :-) The files for the two indices on a single table used 7.8GB of space before a reindex, and 4.4GB after. The table had been reindexed over the weekend and a vacuum was completed on the table about 2 hours ago. The two indices are now 3.4GB smaller. I don't think this counts as bloat, because of our use case. Even so, we reindex our whole database every weekend. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] postgresql performance tuning
Tom Lane wrote: Alan Stange [EMAIL PROTECTED] writes: Vivek Khera wrote: what evidence do you have that you are suffering index bloat? The files for the two indices on a single table used 7.8GB of space before a reindex, and 4.4GB after. That's not bloat ... that's pretty nearly in line with the normal expectation for a btree index, which is about 2/3rds fill factor. If the compacted index were 10X smaller then I'd agree that you have a bloat problem. I wrote I don't think this counts as bloat I still don't. -- Alan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement. I did observe that 32KB block sizes were a significant win for our usage patterns. It might be a win for any of the following reasons: 0) The preliminaries: ~300GB database with about ~50GB daily turnover. Our data is fairly reasonably grouped. If we're getting one item on a page we're usually looking at the other items as well. 1) we can live with a smaller FSM size. We were often leaking pages with a 10M page FSM setting. With 32K pages, a 10M FSM size is sufficient. Yes, the solution to this is run vacuum more often, but when the vacuum was taking 10 hours at a time, that was hard to do. 2) The typical datum size in our largest table is about 2.8KB, which is more than 1/4 page size thus resulting in the use of a toast table. Switching to 32KB pages allows us to get a decent storage of this data into the main tables, thus avoiding another table and associated large index. Not having the extra index in memory for a table with 90M rows is probably beneficial. 3) vacuum time has been substantially reduced. Vacuum analyze now run in the 2 to 3 hour range depending on load. 4) less cpu time spent in the kernel. We're basically doing 1/4 as many system calls. Overall the system has now been working well. We used to see the database being a bottleneck at times, but now it's keeping up nicely. Hope this helps. Happy Thanksgiving! -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke, - XFS will probably generate better data rates with larger files. You really need to use the same file size as does postgresql. Why compare the speed to reading a 16G file and the speed to reading a 1G file. They won't be the same. If need be, write some code that does the test or modify lmdd to read a sequence of 1G files. Will this make a difference? You don't know until you do it. Any time you cross a couple of 2^ powers in computing, you should expect some differences. - you did umount the file system before reading the 16G file back in? Because if you didn't then your read numbers are possibly garbage. When the read began, 8G of the file was in memory. You'd be very naive to think that somehow the read of the first 8GB somehow flushed that cached data out of memory. After all, why would the kernel flush pages from file X when you're in the middle of a sequential read of...file X? I'm not sure how Linux handles this, but Solaris would've found the 8G still in memory. - What was the hardware and disk configuration on which these numbers were generated? For example, if you have a U320 controller, how did the read rate become larger than 320MB/s? - how did the results change from before? Just posting the new results is misleading given all the boasting we've had to read about your past results. - there are two results below for writing to ext2: one at 209 MB/s and one at 113MB/s. Why are they different? - what was the cpu usage during these tests? We see postgresql doing 200+MB/s of IO. You've claimed many times that the machine would be compute bound at lower IO rates, so how much idle time does the cpu still have? - You wrote: We'll do a 16GB table size to ensure that we aren't reading from the read cache. Do you really believe that?? You have to umount the file system before each test to ensure you're really measuring the disk IO rate. If I'm reading your results correctly, it looks like you have three results for ext and xfs, each of which is faster than the prior one. If I'm reading this correctly, then it looks like one is clearly reading from the read cache. - Gee, it's so nice of you to drop your 120MB/s observation. I guess my reading at 300MB/s wasn't convincing enough. Yeah, I think it was the cpus too... - I wouldn't focus on the flat 64% of the data rate number. It'll probably be different on other systems. I'm all for testing and testing. It seems you still cut a corner without umounting the file system first. Maybe I'm a little too old school on this, but I wouldn't spend a dime until you've done the measurements correctly. Good Luck. -- Alan Luke Lonergan wrote: Alan, Looks like Postgres gets sensible scan rate scaling as the filesystem speed increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs got faster since I last tested this. The scaling looks like 64% of the I/O subsystem speed is available to the executor - so as the I/O subsystem increases in scan rate, so does Postgres' executor scan speed. So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? - Luke = Results === Unless noted otherwise all results posted are for block device readahead set to 16M using blockdev --setra=16384 block_device. All are using the 2.6.9-11 Centos 4.1 kernel. For those who don't have lmdd, here is a comparison of two results on an ext2 filesystem: [EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync) 80+0 records in 80+0 records out real0m33.057s user0m0.116s sys 0m13.577s [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync=1 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec real0m33.032s user0m0.087s sys 0m13.129s So lmdd with sync=1 is equivalent to a sync after a dd. I use 2x memory with dd for the *READ* performance testing, but let's make sure things are synced on both write and read for this set of comparisons. First, let's test ext2 versus ext3, data=ordered, versus xfs: 16GB write, then read --- ext2: --- [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=200 sync=1 16384. MB in 144.2670 secs, 113.5672 MB/sec [EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=200 sync=1 16384. MB in 49.3766 secs, 331.8170 MB/sec --- ext3, data=ordered:
Re: [PERFORM] Hardware/OS recommendations for large databases (
Bruce Momjian wrote: Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? Well I think what he's saying is that Postgres is issuing a read, then waiting for the data to return. Then it does some processing, and goes back to issue another read. The CPU is idle half the time because Postgres isn't capable of doing any work while waiting for i/o, and the i/o system is idle half the time while the CPU intensive part happens. (Consider as a pathological example a program that reads 8k then sleeps for 10ms, and loops doing that 1,000 times. Now consider the same program optimized to read 8M asynchronously and sleep for 10s. By the time it's finished sleeping it has probably read in all 8M. Whereas the program that read 8k in little chunks interleaved with small sleeps would probably take twice as long and appear to be entirely i/o-bound with 50% iowait and 50% idle.) It's a reasonable theory and it's not inconsistent with the results you sent. But it's not exactly proven either. Nor is it clear how to improve matters. Adding additional threads to handle the i/o adds an enormous amount of complexity and creates lots of opportunity for other contention that could easily eat all of the gains. Perfect summary. We have a background writer now. Ideally we would have a background reader, that reads-ahead blocks into the buffer cache. The problem is that while there is a relatively long time between a buffer being dirtied and the time it must be on disk (checkpoint time), the read-ahead time is much shorter, requiring some kind of quick create a thread approach that could easily bog us down as outlined above. Right now the file system will do read-ahead for a heap scan (but not an index scan), but even then, there is time required to get that kernel block into the PostgreSQL shared buffers, backing up Luke's observation of heavy memcpy() usage. So what are our options? mmap()? I have no idea. Seems larger page size does help. For sequential scans, you do have a background reader. It's the kernel. As long as you don't issue a seek() between read() calls, the kernel will get the hint about sequential IO and begin to perform a read ahead for you. This is where the above analysis isn't quite right: while postgresql is processing the returned data from the read() call, the kernel has also issued reads as part of the read ahead, keeping the device busy while the cpu is busy. (I'm assuming these details for Linux; Solaris/UFS does work this way). Issue one seek on the file and the read ahead algorithm will back off for a while. This was my point about some descriptions of how the system works not being sensible. If your goal is sequential IO, then one must use larger block sizes. No one would use 8KB IO for achieving high sequential IO rates. Simply put, read() is about the slowest way to get 8KB of data. Switching to 32KB blocks reduces all the system call overhead by a large margin. Larger blocks would be better still, up to the stripe size of your mirror. (Of course, you're using a mirror and not raid5 if you care about performance.) I don't think the memcpy of data from the kernel to userspace is that big of an issue right now. dd and all the high end network interfaces manage OK doing it, so I'd expect postgresql to do all right with it now yet too. Direct IO will avoid that memcpy, but then you also don't get any caching of the files in memory. I'd be more concerned about any memcpy calls or general data management within postgresql.Does postgresql use the platform specific memcpy() in libc? Some care might be needed to ensure that the memory blocks within postgresql are all properly aligned to make sure that one isn't ping-ponging cache lines around (usually done by padding the buffer sizes by an extra 32 bytes or L1 line size). Whatever you do, all the usual high performance computing tricks should be used prior to considering any rewriting of major code sections. Personally, I'd like to see some detailed profiling being done using hardware counters for cpu cycles and cache misses, etc. Given the poor quality of work that has been discussed here in this thread, I don't have much confidence in any other additional results at this time. None of the analysis would be acceptable in any environment in which I've worked. Be sure to take a look at Sun's free Workshop tools as they are excellent for this sort of profiling and one doesn't need to recompile to use them.If I get a little time in the next week or two I might take a crack at this. Cheers, -- Alan ---(end
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: OK - slower this time: We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but which all are capped at 120MB/s when doing sequential scans with different versions of Postgres. Postgresql issues the exact same sequence of read() calls as does dd. So why is dd so much faster? I'd be careful with the dd read of a 16GB file on an 8GB system. Make sure you umount the file system first, to make sure all of the file is flushed from memory. Some systems use a freebehind on sequential reads to avoid flushing memory...and you'd find that 1/2 of your 16GB file is still in memory. The same point also holds for the writes: when dd finishes not all the data is on disk. You need to issue a sync() call to make that happen. Use lmdd to ensure that the data is actually all written. In other words, I think your dd results are possibly misleading. It's trivial to demonstrate: $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80 80+0 records in 80+0 records out real0m13.780s user0m0.134s sys 0m13.510s Oops. I just wrote 470MB/s to a file system that has peak write speed of 200MB/s peak. Now, you might say that you wrote a 16GB file on an 8 GB machine so this isn't an issue. It does make your dd numbers look fast as some of the data will be unwritten. I'd also suggest running dd on the same files as postgresql. I suspect you'd find that the layout of the postgresql files isn't that good as they are grown bit by bit, unlike the file created by simply dd'ing a large file. Understand my point: It doesn't matter that there is idle or iowait on the CPU, the postgres executor is not able to drive the I/O rate for two reasons: there is a lot of CPU used for the scan (the 40% you reported) and a lack of asynchrony (the iowait time). That means that by speeding up the CPU you only reduce the first part, but you don't fix the second and v.v. With more aggressive readahead, the second problem (the I/O asynchrony) is handled better by the Linux kernel and filesystem. That's what we're seeing with XFS. I think your point doesn't hold up. Every time you make it, I come away posting another result showing it to be incorrect. The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? The issue here is that the storage system is very active doing reads on the files...which might be somewhat poorly allocated on disk because postgresql grows the tables bit by bit. I had the same readahead in Reiser and in XFS. The XFS performance was better because XFS does a better job of large file allocation on disk, thus resulting in many fewer seeks (generated by the file system itself) to read the files back in. As an example, some file systems like UFS purposely scatter large files across cylinder groups to avoid forcing large seeks on small files; one can tune this behavior so that large files are more tightly allocated. Of course, because this is engineering, I have another obligatory data point: This time it's a 4.2GB table using 137,138 32KB pages with nearly 41 million rows. A select count(1) on the table completes in 14.6 seconds, for an average read rate of 320 MB/s. One cpu was idle, the other averaged 32% system time and 68 user time for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu would show increased performance as I really am cpu bound finally. Postgresql is clearly able to issue the relevant sequential read() system calls and sink the resulting data without a problem if the file system is capable of providing the data. It can do this up to a speed of ~300MB/s on this class of system. Now it should be fairly simple to tweak the few spots where some excess memory copies are being done and up this result substantially. I hope postgresql is always using the libc memcpy as that's going to be a lot faster then some private routine. -- Alan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke, it's time to back yourself up with some numbers. You're claiming the need for a significant rewrite of portions of postgresql and you haven't done the work to make that case. You've apparently made some mistakes on the use of dd to benchmark a storage system. Use lmdd and umount the file system before the read and post your results. Using a file 2x the size of memory doesn't work corectly. You can quote any other numbers you want, but until you use lmdd correctly you should be ignored. Ideally, since postgresql uses 1GB files, you'll want to use 1GB files for dd as well. Luke Lonergan wrote: Alan, On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote: $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80 80+0 records in 80+0 records out real0m13.780s user0m0.134s sys 0m13.510s Oops. I just wrote 470MB/s to a file system that has peak write speed of 200MB/s peak. How much RAM on this machine? Doesn't matter. The result will always be wrong without a call to sync() or fsync() before the close() if you're trying to measure the speed of the disk subsystem. Add that sync() and the result will be correct for any memory size. Just for completeness: Solaris implicitly calls sync() as part of close. Bonnie used to get this wrong, so quoting Bonnie isn't any good. Note that on some systems using 2x memory for these tests is almost OK. For example, Solaris used to have a hiwater mark that would throttle processes and not allow more than a few 100K of writes to be outstanding on a file. Linux/XFS clearly allows a lot of write data to be outstanding. It's best to understand the tools and know what they do and why they can be wrong than simply quoting some other tool that makes the same mistakes. I find that postgresql is able to achieve about 175MB/s on average from a system capable of delivering 200MB/s peak and it does this with a lot of cpu time to spare. Maybe dd can do a little better and deliver 185MB/s.If I were to double the speed of my IO system, I might find that a single postgresql instance can sink about 300MB/s of data (based on the last numbers I posted). That's why I have multi-cpu opterons and more than one query/client as they soak up the remaining IO capacity. It is guaranteed that postgresql will hit some threshold of performance in the future and possible rewrites of some core functionality will be needed, but no numbers posted here so far have made the case that postgresql is in trouble now. In the mean time, build balanced systems with cpus that match the capabilities of the storage subsystems, use 32KB block sizes for large memory databases that are doing lots of sequential scans, use file systems tuned for large files, use opterons, etc. As always, one has to post some numbers. Here's an example of how dd doesn't do what you might expect: mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k 16.7772 MB in 0.0235 secs, 714.5931 MB/sec mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k sync=1 16.7772 MB in 0.1410 secs, 118.9696 MB/sec Both numbers are correct. But one measures the kernels ability to absorb 2000 8KB writes with no guarantee that the data is on disk and the second measures the disk subsystems ability to write 16MB of data. dd is equivalent to the first result. You can't use the first type of result and complain that postgresql is slow. If you wrote 16G of data on a machine with 8G memory then your dd result is possibly too fast by a factor of two as 8G of the data might not be on disk yet. We won't know until you post some results. Cheers, -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
William Yu wrote: Alan Stange wrote: Luke Lonergan wrote: The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. I'm sure the the theory is nice but here's my experience with iowait just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to lookup some stuff in Wikipedia under Mozilla and my computer system became completely unusable for nearly a minute while who knows what Mozilla was doing. (Probably loading all the language packs.) I could not even switch to IRC (already loaded) to chat with other people while Mozilla was chewing up all my disk I/O. So I went to another computer, connected to mine remotely (slow...) and checked top. 90% in the wa column which I assume is the iowait column. It may be idle in theory but it's not a very useful idle -- wasn't able to switch to any programs already running, couldn't click on the XFce launchbar to run any new programs. So, you have a sucky computer.I'm sorry, but iowait is still idle time, whether you believe it or not. -- Alan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. I don't think that's true. If the syscall was preemptable then it wouldn't show up under iowait, but rather idle. The time spent in iowait is time in uninterruptable sleeps where no other process can be scheduled. That would be wrong. The time spent in iowait is idle time. The iowait stat would be 0 on a machine with a compute bound runnable process available for each cpu. Come on people, read the man page or look at the source code. Just stop making stuff up. iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. Here's the man page: %iowait Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. If the system had some other cpu bound work to perform you wouldn't ever see any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan using the one set of numbers I posted is misunderstanding the actual metrics. That's easy to test. rerun the test with another process running a simple C program like main() {while(1);} (or two invocations of that on your system because of the extra processor). I bet you'll see about half the percentage of iowait because postres will get half as much opportunity to schedule i/o. If what you are saying were true then you should get 0% iowait. Yes, I did this once about 10 years ago. But instead of saying I bet and guessing at the result, you should try it yourself. Without guessing, I can tell you that the iowait time will go to 0%. You can do this loop in the shell, so there's no code to write. Also, it helps to do this with the shell running at a lower priority. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote: Yes and no. The one cpu is clearly idle. The second cpu is 40% busy and 60% idle (aka iowait in the above numbers). The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. Here's the man page: %iowait Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. If the system had some other cpu bound work to perform you wouldn't ever see any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan using the one set of numbers I posted is misunderstanding the actual metrics. Thanks for your data, it exemplifies many of the points brought up: - Lots of disks and expensive I/O hardware does not help improve performance on large table queries because I/O bandwidth does not scale beyond 110-120MB/s on the fastest CPUs I don't think that is the conclusion from anecdotal numbers I posted. This file subsystem doesn't perform as well as expected for any tool. Bonnie, dd, star, etc., don't get a better data rate either. In fact, the storage system wasn't built for performance; it was build to reliably hold a big chunk of data. Even so, postgresql is reading at 130MB/s on it, using about 30% of a single cpu, almost all of which was system time. I would get the same 130MB/s on a system with cpus that were substantially slower; the limitation isn't the cpus, or postgresql. It's the IO system that is poorly configured for this test, not postgresqls ability to use it. In fact, given the numbers I posted, it's clear this system could handily generate more than 120 MB/s using a single cpu given a better IO subsystem; it has cpu time to spare. A simple test can be done: build the database in /dev/shm and time the scans. It's the same read() system call being used and now one has made the IO system infinitely fast. The claim is being made that standard postgresql is unable to generate more than 120MB/s of IO on any IO system due to an inefficient use of the kernel API and excessive memory copies, etc. Having the database be on a ram based file system is an example of expensive IO hardware and all else would be the same. Hmmm, now that I think about this, I could throw a medium sized table onto /dev/shm using tablespaces on one of our 8GB linux boxes.So why is this experiment not valid, or what is it about the above assertion that I am missing? Anyway, if one cares about high speed sequential IO, then one should use a much larger block size to start. Using 8KB IOs is inappropriate for such a configuration. We happen to be using 32KB blocks on our largest database and it's been the best move for us. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Another data point. We had some down time on our system today to complete some maintenance work. It took the opportunity to rebuild the 700GB file system using XFS instead of Reiser. One iostat output for 30 seconds is avg-cpu: %user %nice%sys %iowait %idle 1.580.00 19.69 31.94 46.78 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 343.73175035.73 277.555251072 8326 while doing a select count(1) on the same large table as before. Subsequent iostat output all showed that this data rate was being maintained. The system is otherwise mostly idle during this measurement. The sequential read rate is 175MB/s. The system is the same as earlier, one cpu is idle and the second is ~40% busy doing the scan and ~60% idle. This is postgresql 8.1rc1, 32KB block size. No tuning except for using a 1024KB read ahead. The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel controller). I see no reason why this configuration wouldn't generate higher IO rates if a faster IO connection were available. Can you explain again why you think there's an IO ceiling of 120MB/s because I really don't understand? -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote: That's interesting, as I occasionally see more than 110MB/s of postgresql IO on our system. I'm using a 32KB block size, which has been a huge win in performance for our usage patterns. 300GB database with a lot of turnover. A vacuum analyze now takes about 3 hours, which is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory, Linux 2.6.11, FC drives. 300GB / 3 hours = 27MB/s. That's 3 hours under load, with 80 compute clients beating on the database at the same time. We have the stats turned way up, so the analyze tends to read a big chunk of the tables a second time as well.We typically don't have three hours a day of idle time. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Richard Huxton wrote: Dave Cramer wrote: On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote: Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. That's interesting, as I occasionally see more than 110MB/s of postgresql IO on our system. I'm using a 32KB block size, which has been a huge win in performance for our usage patterns. 300GB database with a lot of turnover. A vacuum analyze now takes about 3 hours, which is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory, Linux 2.6.11, FC drives. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 8:13 AM, Alan Stange [EMAIL PROTECTED] wrote: I told you in my initial post that I was observing numbers in excess of what you claiming, but you seemed to think I didn't know how to measure an IO rate. Prove me wrong, post your data. I should note too that our system uses about 20% of a single cpu when performing a table scan at 100MB/s of IO. I think you claimed the system would be cpu bound at this low IO rate. See above. Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. We've done nothing fancy and achieved results you claim shouldn't be possible. This is a system that was re-installed yesterday, no tuning was done to the file systems, kernel or storage array. What am I doing wrong? 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO (for a DOE lab). And now I don't know what I'm doing, Cheers, -- Alan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote: Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. Cool - thanks for the results. Is that % of one CPU, or of 2? Was the system otherwise idle? Actually, this was dual cpu and there was other activity during the full minute, but it was on other file devices, which I didn't include in the above output. Given that, and given what I see on the box now I'd raise the 20% to 30% just to be more conservative. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. One IMHO obvious improvement would be to have vacuum and analyze only do direct IO. Now they appear to be very effective memory flushing tools. Table scans on tables larger than say 4x memory should probably also use direct IO for reads. We've done nothing fancy and achieved results you claim shouldn't be possible. This is a system that was re-installed yesterday, no tuning was done to the file systems, kernel or storage array. Are you happy with 130MB/s? How much did you pay for that? Is it more than $2,000, or double my 2003 PC? I don't know what the system cost. It was part of block of dual opterons from Sun that we got some time ago. I think the 130MB/s is slow given the hardware, but it's acceptable. I'm not too price sensitive; I care much more about reliability, uptime, etc. What am I doing wrong? 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO (for a DOE lab). And now I don't know what I'm doing, Cool. Would that be Sandia? We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on complex queries. Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to change how you think when you have that much data. And hope you don't have a fire, because there's no backup. That work was while I was at BNL. I believe they are now at 4PB of tape and 150TB of disk. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: opterons from Sun that we got some time ago. I think the 130MB/s is slow given the hardware, but it's acceptable. I'm not too price sensitive; I care much more about reliability, uptime, etc. I don't know what the system cost. It was part of block of dual Then I know what they cost - we have them too (V20z and V40z). You should be getting 400MB/s+ with external RAID. Yes, but we don't. This is where I would normally begin a rant on how craptacular Linux can be at times. But, for the sake of this discussion, postgresql isn't reading the data any more slowly than does any other program. And we don't have the time to experiment with the box. I know it should be better, but it's good enough for our purposes at this time. -- Alan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: Luke Lonergan wrote: Alan, On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote: Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. Cool - thanks for the results. Is that % of one CPU, or of 2? Was the system otherwise idle? Actually, this was dual cpu I hate to agree with him but that looks like a dual machine with one CPU pegged. Yes most of the time is being spent in the kernel, but you're still basically cpu limited. That said, 130MB/s is nothing to sneeze at, that's maxing out two high end drives and quite respectable for a 3-disk stripe set, even reasonable for a 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and only getting 130MB/s then it does seem likely the cpu is actually holding you back here. Still it doesn't show Postgres being nearly so CPU wasteful as the original poster claimed. Yes and no. The one cpu is clearly idle. The second cpu is 40% busy and 60% idle (aka iowait in the above numbers). Of that 40%, other things were happening as well during the 1 minute snapshot. During some iostat outputs that I didn't post the cpu time was ~ 20%. So, you can take your pick. The single cpu usage is somewhere between 20% and 40%. As I can't remove other users of the system, it's the best measurement that I can make right now. Either way, it's not close to being cpu bound. This is with Opteron 248, 2.2Ghz cpus. Note that the storage system has been a bit disappointing: it's an IBM Fast T600 with a 200MB/s fiber attachment. It could be better, but it's not been the bottleneck in our work, so we haven't put any energy into it. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. Well changing to direct i/o would still be changing Postgres so that's unclear. And there are plenty of more mundane ways that Postgres is responsible for how efficiently or not the kernel is used. Just using fewer syscalls to do the same amount of reading would reduce cpu consumption. Absolutely. This is why we're using a 32KB block size and also switched to using O_SYNC for the WAL syncing method. That's many MB/s that don't need to be cached in the kernel (thus evicting other data), and we avoid all the fysnc/fdatasync syscalls. The purpose of direct IO isn't to make the vacuum or analyze faster, but to lessen their impact on queries with someone waiting for the results. That's our biggest hit: running a sequential scan on 240GB of data and flushing everything else out of memory. Now that I'm think about this a bit, a big chunk of time is probably being lost in TLB misses and other virtual memory events that would be avoided if a larger page size was being used. -- Alan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Is There Any Way ....
Alex Turner wrote: This is possible with Oracle utilizing the keep pool alter table t_name storage ( buffer_pool keep); If Postgres were to implement it's own caching system, this seems like it would be easily to implement (beyond the initial caching effort). Alex On 10/24/05, Craig A. James [EMAIL PROTECTED] wrote: Jim C. Nasby jnasby ( at ) pervasive ( dot ) com wrote: Stefan Weiss wrote: ... IMO it would be useful to have a way to tell PG that some tables were needed frequently, and should be cached if possible. This would allow application developers to consider joins with these tables as cheap, even when querying on columns that are not indexed. Why do you think you'll know better than the database how frequently something is used? At best, your guess will be correct and PostgreSQL (or the kernel) will keep the table in memory. Or, your guess is wrong and you end up wasting memory that could have been used for something else. It would probably be better if you describe why you want to force this table (or tables) into memory, so we can point you at more appropriate solutions. Or perhaps we could explain why we NEED to force these tables into memory, so we can point you at a more appropriate implementation. ;-) Ok, wittiness aside, here's a concrete example. I have an application with one critical index that MUST remain in memory at all times. The index's tablespace is about 2 GB. As long as it's in memory, performance is excellent - a user's query takes a fraction of a second. But if it gets swapped out, the user's query might take up to five minutes as the index is re-read from memory. Now here's the rub. The only performance I care about is response to queries from the web application. Everything else is low priority. But there is other activity going on. Suppose, for example, that I'm updating tables, performing queries, doing administration, etc., etc., for a period of an hour, during which no customer visits the site. The another customer comes along and performs a query. At this point, no heuristic in the world could have guessed that I DON'T CARE ABOUT PERFORMANCE for anything except my web application. The performance of all the other stuff, the administration, the updates, etc., is utterly irrelevant compared to the performance of the customer's query. What actually happens is that the other activities have swapped out the critical index, and my customer waits, and waits, and waits... and goes away after a minute or two. To solve this, we've been forced to purchase two computers, and mirror the database on both. All administration and modification happens on the offline database, and the web application only uses the online database. At some point, we swap the two servers, sync the two databases, and carry on. It's a very unsatisfactory solution. We have a similar problem with vacuum being the equivalent of continuously flush all system caches for a long time. Our database is about 200GB in size and vacuums take hours and hours. The performance is acceptable still, but only because we've hidden the latency in our application. I've occasionally thought it would be good to have the backend doing a vacuum or analyze also call priocntl() prior to doing any real work to lower its priority. We'll be switching to the 8.1 release ASAP just because the direct IO capabilities are appearing to be a win on our development system. -- Alan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Performance considerations for very heavy INSERT traffic
Brandon Black wrote: On 9/12/05, *PFC* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: - benchmarking something else than ext3 (xfs ? reiser3 ?) We've had bad experiences under extreme and/or strange workloads with XFS here in general, although this is the first major postgresql project - the rest were with other applications writing to XFS. Bad experiences like XFS filesystems detecting internal inconsistencies at runtime and unmounting themselves from within the kernel module (much to the dismay of applications with open files on the filesystem), on machines with validated good hardware. It has made me leary of using anything other than ext3 for fear of stability problems. Reiser3 might be worth taking a look at though. Just one tidbit. We tried XFS on a very active system similar to what you describe. Dual opterons, 8GB memory, fiber channel drives, 2.6 kernel, etc. And the reliability was awful. We spent a lot of time making changes one at a time to try and isolate the cause; when we switched out from XFS to ReiserFS our stability problems went away. It may be the case that the XFS problems have all been corrected in newer kernels, but I'm not going to put too much effort into trying that again. I recently built a postgres with 32KB block sizes and have been doing some testing. For our particular workloads it has been a win. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] difference in plan between 8.0 and 8.1?
Tom Lane wrote: Alan Stange [EMAIL PROTECTED] writes: Unique (cost=2717137.08..2771407.21 rows=10854026 width=8) - Sort (cost=2717137.08..2744272.14 rows=10854026 width=8) Sort Key: timeseriesid - Bitmap Heap Scan on tbltimeseries (cost=48714.09..1331000.42 rows=10854026 width=8) Recheck Cond: (timeseriesid 0) - Bitmap Index Scan on idx_timeseris (cost=0.00..48714.09 rows=10854026 width=0) Index Cond: (timeseriesid 0) (7 rows) I'm hoping someone can explain the new query plan (as I'm not sure I understand what it is doing). The index scan is reading the index to find out which heap tuple IDs (TIDs) the index says meet the condition. It returns a bitmap of the tuple locations (actually, an array of per-page bitmaps). The heap scan goes and fetches the tuples from the table, working in TID order to avoid re-reading the same page many times, as can happen for ordinary index scans. Since the result isn't sorted, we have to do a sort to get it into the correct order for the Unique step. Because it avoids random access to the heap, this plan can be a lot faster than a regular index scan. I'm not sure at all that 8.1 is doing good relative cost estimation yet, though. It would be interesting to see EXPLAIN ANALYZE results for both ways. (You can use enable_bitmapscan and enable_indexscan to force the planner to pick the plan it thinks is slower.) Just to be clear. The index is on the timeseriesid column. Also, We usually have the where clause with some non-zero number. Anyway, here's the basic query, with variations added on belowe: fiasco=# explain analyze select timeseriesid from tbltimeseries where timeseriesid 0; QUERY PLAN Bitmap Heap Scan on tbltimeseries (cost=48906.82..1332935.19 rows=10905949 width=8) (actual time=16476.337..787480.979 rows=10907853 loops=1) Recheck Cond: (timeseriesid 0) - Bitmap Index Scan on idx_timeseris (cost=0.00..48906.82 rows=10905949 width=0) (actual time=16443.585..16443.585 rows=10907853 loops=1) Index Cond: (timeseriesid 0) Total runtime: 791340.341 ms (5 rows) Now add the order: fiasco=# explain analyze select timeseriesid from tbltimeseries where timeseriesid 0 order by timeseriesid; QUERY PLAN -- Sort (cost=2726087.93..2753352.81 rows=10905949 width=8) (actual time=821090.666..826353.054 rows=10913868 loops=1) Sort Key: timeseriesid - Bitmap Heap Scan on tbltimeseries (cost=48912.82..1332941.19 rows=10905949 width=8) (actual time=16353.921..757075.349 rows=10913868 loops=1) Recheck Cond: (timeseriesid 0) - Bitmap Index Scan on idx_timeseris (cost=0.00..48912.82 rows=10905949 width=0) (actual time=16335.239..16335.239 rows=10913868 loops=1) Index Cond: (timeseriesid 0) Total runtime: 830829.145 ms (7 rows) and the distinct: fiasco=# explain analyze select distinct timeseriesid from tbltimeseries where timeseriesid 0 order by timeseriesid; QUERY PLAN Unique (cost=2726087.93..2780617.68 rows=10905949 width=8) (actual time=816938.970..831119.423 rows=10913868 loops=1) - Sort (cost=2726087.93..2753352.81 rows=10905949 width=8) (actual time=816938.967..822298.802 rows=10913868 loops=1) Sort Key: timeseriesid - Bitmap Heap Scan on tbltimeseries (cost=48912.82..1332941.19 rows=10905949 width=8) (actual time=15866.736..752851.006 rows=10913868 loops=1) Recheck Cond: (timeseriesid 0) - Bitmap Index Scan on idx_timeseris (cost=0.00..48912.82 rows=10905949 width=0) (actual time=15852.652..15852.652 rows=10913868 loops=1) Index Cond: (timeseriesid 0) Total runtime: 835558.312 ms (8 rows) Now the usual query from 8.0: fiasco=# set enable_bitmapscan=false; explain analyze select distinct timeseriesid from tbltimeseries where timeseriesid 0 order by timeseriesid; SET QUERY PLAN -- Unique (cost=0.00..14971276.10 rows=10905949 width=8) (actual time=24.930..999645.638 rows=10913868 loops=1) - Index Scan using
Re: [PERFORM] Caching by Postgres
[EMAIL PROTECTED] wrote: So then we move on to what 64-bit is really useful for. Obviously, there is the arithmetic. If you were previously doing 64-bit arithmetic through software, you will notice an immediate speed improvement when doing it through hardware instead. If you have a program that is scanning memory in any way, it may benefit from 64-bit instructions (for example - copying data 64-bit words at a time instead of 32-bit words at a time). PostgreSQL might benefit slightly from either of these, slightly balancing the performance degradation of using more memory to store the pointers, and more memory bandwidth the access the pointers. At least on Sparc processors, v8 and newer, any double precision math (including longs) is performed with a single instruction, just like for a 32 bit datum. Loads and stores of 8 byte datums are also handled via a single instruction. The urban myth that 64bit math is different/better on a 64 bit processor is just that; yes, some lower end processors would emulate/trap those instructions but that an implementation detail, not architecture.I believe that this is all true for other RISC processors as well. The 64bit API on UltraSparcs does bring along some extra FP registers IIRC. If, however, you happen to have a very large amount of physical memory - more memory than is supported by a 32-bit system, but is supported by your 64-bit system, then the operating system should be able to use this additional physical memory to cache file system data pages, which will benefit PostgreSQL if used with tables that are larger than the memory supported by your 32-bit system, and which have queries which require more pages than the memory supported by your 32-bit system to be frequently accessed. If you have a huge database, with many clients accessing the data, this would be a definate yes. With anything less, it is a maybe, or a probably not. Solaris, at least, provided support for far more than 4GB of physical memory on 32 bit kernels. A newer 64 bit kernel might be more efficient, but that's just because the time was taken to support large page sizes and more efficient data structures. It's nothing intrinsic to a 32 vs 64 bit kernel. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Read/Write block sizes
Josh Berkus wrote: Steve, I would assume that dbt2 with STP helps minimize the amount of hours someone has to invest to determine performance gains with configurable options? Actually, these I/O operation issues show up mainly with DW workloads, so the STP isn't much use there. If I can ever get some of these machines back from the build people, I'd like to start testing some stuff. One issue with testing this is that currently PostgreSQL doesn't support block sizes above 128K. We've already done testing on that (well, Mark has) and the performance gains aren't even worth the hassle of remembering you're on a different block size (like, +4%). What size database was this on? What the Sun people have done with other DB systems is show that substantial performance gains are possible on large databases (100G) using block sizes of 1MB. I believe that's possible (and that it probably makes more of a difference on Solaris than on BSD) but we can't test it without some hackery first. We're running on a 100+GB database, with long streams of 8KB reads with the occasional _llseek(). I've been thinking about running with a larger blocksize with the expectation that we'd see fewer system calls and a bit more throughput. read() calls are a very expensive way to get 8KB of memory (that we know is already resident) during scans. One has to trap into the kernel, do the usual process state accounting, find the block, copy the memory to userspace, return back from the kernel to user space reversing all the process accounting, pick out the bytes one needs, and repeat all over again.That's quite a few sacrificial cache lines for 8KB. Yeah, sure, Linux syscalls are fast, but they aren't that fast, and other operating systems (windows and solaris) have a bit more overhead on syscalls. Regarding large blocks sizes on Solaris: the Solaris folks can also use large memory pages and avoid a lot of the TLB overhead from the VM system. The various trapstat and cpustat commands can be quite interesting to look at when running any large application on a Solaris system. It should be noted that having a large shared memory segment can be a performance looser just from the standpoint of TLB thrashing. O(GB) memory access patterns can take a huge performance hit in user space with 4K pages compared to the kernel which would be mapping the segmap (in Solaris parlance) with 4MB pages. Anyway, I guess my point is that the balance between kernel managed vs. postgresql managed buffer isn't obvious at all. -- Alan ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PERFORM] BG writer question?
Hello all, I just was running strace in the writer process and I noticed this pattern: select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout) getppid() = 4240 time(NULL) = 1123773324 mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x81000) = 0x69ea3000 semop(1409034, 0xc0bc, 1) = 0 ...seeks and writes... munmap(0x69ea3000, 528384) = 0 select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout) getppid() = 4240 time(NULL) = 1123773324 mmap2(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0x81000) = 0x69ea3000 semop(1605648, 0xc0bc, 1) = 0 ...seeks and writes... munmap(0x69ea3000, 528384) = 0 select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout) why mmap and munmap each time?mmap and munmap are fairly expensive operations (on some systems), especially on multi cpu machines. munmap in particular generally needs to issue cross calls to the other cpus to ensure any page mappings are invalidated. Just curious. Thanks! -- Alan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] How to improve db performance with $7K?
Alex Turner wrote: [snip] Adding drives will not let you get lower response times than the average seek time on your drives*. But it will let you reach that response time more often. [snip] I believe your assertion is fundamentaly flawed. Adding more drives will not let you reach that response time more often. All drives are required to fill every request in all RAID levels (except possibly 0+1, but that isn't used for enterprise applicaitons). Most requests in OLTP require most of the request time to seek, not to read. Only in single large block data transfers will you get any benefit from adding more drives, which is atypical in most database applications. For most database applications, the only way to increase transactions/sec is to decrease request service time, which is generaly achieved with better seek times or a better controller card, or possibly spreading your database accross multiple tablespaces on seperate paritions. My assertion therefore is that simply adding more drives to an already competent* configuration is about as likely to increase your database effectiveness as swiss cheese is to make your car run faster. Consider the case of a mirrored file system with a mostly read() workload. Typical behavior is to use a round-robin method for issueing the read operations to each mirror in turn, but one can use other methods like a geometric algorithm that will issue the reads to the drive with the head located closest to the desired track.Some systems have many mirrors of the data for exactly this behavior. In fact, one can carry this logic to the extreme and have one drive for every cylinder in the mirror, thus removing seek latencies completely. In fact this extreme case would also remove the rotational latency as the cylinder will be in the disks read cache. :-) Of course, writing data would be a bit slow! I'm not sure I understand your assertion that all drives are required to fill every request in all RAID levels. After all, in mirrored reads only one mirror needs to read any given block of data, so I don't know what goal is achieved in making other mirrors read the same data. My assertion (based on ample personal experience) is that one can *always* get improved performance by adding more drives. Just limit the drives to use the first few cylinders so that the average seek time is greatly reduced and concatenate the drives together. One can then build the usual RAID device out of these concatenated metadevices. Yes, one is wasting lots of disk space, but that's life. If your goal is performance, then you need to put your money on the table. The system will be somewhat unreliable because of the device count, additional SCSI buses, etc., but that too is life in the high performance world. -- Alan ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] How to improve db performance with $7K?
PFC wrote: My argument is that a sufficiently smart kernel scheduler *should* yield performance results that are reasonably close to what you can get with that feature. Perhaps not quite as good, but reasonably close. It shouldn't be an orders-of-magnitude type difference. And a controller card (or drive) has a lot less RAM to use as a cache / queue for reordering stuff than the OS has, potentially the OS can us most of the available RAM, which can be gigabytes on a big server, whereas in the drive there are at most a few tens of megabytes... However all this is a bit looking at the problem through the wrong end. The OS should provide a multi-read call for the applications to pass a list of blocks they'll need, then reorder them and read them the fastest possible way, clustering them with similar requests from other threads. Right now when a thread/process issues a read() it will block until the block is delivered to this thread. The OS does not know if this thread will then need the next block (which can be had very cheaply if you know ahead of time you'll need it) or not. Thus it must make guesses, read ahead (sometimes), etc... All true. Which is why high performance computing folks use aio_read()/aio_write() and load up the kernel with all the requests they expect to make. The kernels that I'm familiar with will do read ahead on files based on some heuristics: when you read the first byte of a file the OS will typically load up several pages of the file (depending on file size, etc). If you continue doing read() calls without a seek() on the file descriptor the kernel will get the hint that you're doing a sequential read and continue caching up the pages ahead of time, usually using the pages you just read to hold the new data so that one isn't bloating out memory with data that won't be needed again. Throw in a seek() and the amount of read ahead caching may be reduced. One point that is being missed in all this discussion is that the file system also imposes some constraints on how IO's can be done. For example, simply doing a write(fd, buf, 1) doesn't emit a stream of sequential blocks to the drives. Some file systems (UFS was one) would force portions of large files into other cylinder groups so that small files could be located near the inode data, thus avoiding/reducing the size of seeks. Similarly, extents need to be allocated and the bitmaps recording this data usually need synchronous updates, which will require some seeks, etc. Not to mention the need to update inode data, etc. Anyway, my point is that the allocation policies of the file system can confuse the situation. Also, the seek times one sees reported are an average. One really needs to look at the track-to-track seek time and also the full stoke seek times. It takes a *long* time to move the heads across the whole platter. I've seen people partition drives to only use small regions of the drives to avoid long seeks and to better use the increased number of bits going under the head in one rotation. A 15K drive doesn't need to have a faster seek time than a 10K drive because the rotational speed is higher. The average seek time might be faster just because the 15K drives are smaller with fewer number of cylinders. -- Alan ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] PostgreSQL on Solaris 8 and ufs
Brandon Metcalf wrote: We've recently moved our pgsql installation and DBs to a Solaris 8 machine with striped and mirrored ufs filesystem that houses the DB data. We are now seeing terrible performance and the bottleneck is no doubt disk I/O. We've tried modifying a tunables related to ufs, but it doesn't seem to be helping. Is there anything we should be looking at that is specifically related to ufs filesystems on Solaris 8 or possibly something in general that would improve performance? Well, Solaris 8 is a bit old now, so I don't remember all the details. But, if memory servers, Solaris 8 still has some high water and lo water tunables related to the amount of IO can be outstanding to a single file. Try setting set ufs:ufs_WRITES=0 in /etc/system and rebooting, which basically says any amount of disk IO can be outstanding. There's a tunables doc on docs.sun.com that explains this option. Also, logging UFS might help with some of the metadata requirements of UFS as well. So, use mount -o logging or add the relevant entry in /etc/vfstab. Of course, the best thing is Solaris 9 or 10, which would be much better for this sort of thing. Hope this helps. -- Alan ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] multi billion row tables: possible or insane?
Ramon Bastiaans wrote: I am doing research for a project of mine where I need to store several billion values for a monitoring and historical tracking system for a big computer system. My currect estimate is that I have to store (somehow) around 1 billion values each month (possibly more). I was wondering if anyone has had any experience with these kind of big numbers of data in a postgres sql database and how this affects database design and optimization. What would be important issues when setting up a database this big, and is it at all doable? Or would it be a insane to think about storing up to 5-10 billion rows in a postgres database. The database's performance is important. There would be no use in storing the data if a query will take ages. Query's should be quite fast if possible. I would really like to hear people's thoughts/suggestions or go see a shrink, you must be mad statements ;) It just dawned on me that we're doing something that, while not the same, might be relevant. One of our tables has ~85M rows in it according to the output from an explain select * from table. I don't plan on trying a select count(*) any time soon :)We add and remove about 25M rows a day to/from this table which would be about 750M rows/month total. Given our current usage of the database, it could handle a larger row/day rate without too much trouble. (The problem isn't adding rows but deleting rows.) Column| Type | Modifiers --+--+--- timeseriesid | bigint | bindata | bytea| binsize | integer | rateid | smallint | ownerid | smallint | Indexes: idx_timeseries btree (timeseriesid) In this case, each bytea entry is typically about 2KB of data, so the total table size is about 150GB, plus some index overhead. A second table has ~100M rows according to explain select *. Again it has about 30M rows added and removed / day. Column | Type | Modifiers +---+--- uniqid | bigint| type | character varying(50) | memberid | bigint| tag| character varying(50) | membertype | character varying(50) | ownerid| smallint | Indexes: composite_memberid btree (memberid) composite_uniqid btree (uniqid) There are some additional tables that have a few million rows / day of activity, so call it 60M rows/day added and removed. We run a vacuum every day. The box is an dual Opteron 248 from Sun. Linux 2.6, 8GB of memory. We use reiserfs. We started with XFS but had several instances of file system corruption. Obviously, no RAID 5. The xlog is on a 2 drive mirror and the rest is on separate mirrored volume. The drives are fiber channel but that was a mistake as the driver from IBM wasn't very good. So, while we don't have a billion rows we do have ~200M total rows in all the tables and we're certainly running the daily row count that you'd need to obtain. But scaling this sort of thing up can be tricky and your milage may vary. In a prior career I ran a data intensive computing center and helped do some design work for a high energy physics experiment: petabytes of data, big tape robots, etc., the usual Big Science toys. You might take a look at ROOT and some of the activity from those folks if you don't need transactions and all the features of a general database like postgresql. -- Alan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Swapping on Solaris
Mark Kirkwood wrote: Kevin Schroeder wrote: Ignoring the fact that the sort and vacuum numbers are really high, this is what Solaris shows me when running top: Memory: 2048M real, 1376M free, 491M swap in use, 2955M swap free Maybe check the swap usage with 'swap -l' which reports reliably if any (device or file) swap is actually used. I think Solaris 'top' does some strange accounting to calculate the 'swap in use' value (like including used memory). It looks to me like you are using no (device or file) swap at all, and have 1.3G of real memory free, so could in fact give Postgres more of it :-) I suspect that free memory is in fact being used for the file system cache. There were some changes in the meaning of free in Solaris 8 and 9. The memstat command gives a nice picture of memory usage on the system. I don't think memstat came with Solaris 8, but you can get it from solarisinternals.com. The Solaris Internals book is an excellent read as well; it explains all of this in gory detail. Note that files in /tmp are usually in a tmpfs file system. These files may be the usage of swap that you're seeing (as they will be paged out on an active system with some memory pressure) Finally, just as everyone suggests upgrading to newer postgresql releases, you probably want to get to a newer Solaris release. -- Alan ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Swapping on Solaris
Kevin Schroeder wrote: I suspect that the memory is being used to cache files as well since the email boxes are using unix mailboxes, for the time being. With people checking their email sometimes once per minute I can see why Solaris would want to cache those files. Perhaps my question would be more appropriate to a Solaris mailing list since what I really want to do is get Solaris to simply allow PostgreSQL to use more RAM and reduce the amount of RAM used for file caching. I would have thought that Solaris gives some deference to a running application that's being swapped than for a file cache. Is there any way to set custom parameters on Solaris' file-caching behavior to allow PostgreSQL to use more physical RAM? Your explanation doesn't sound quite correct. If postgresql malloc()'s some memory and uses it, the file cache will be reduced in size and the memory given to postgresql. But if postgresql doesn't ask for or use the memory, then solaris is going to use it for something else. There's nothing in Solaris that doesn't allow postgresql to use more RAM. -- Alan ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Swapping on Solaris
Kevin Schroeder wrote: I take that back. There actually is some paging going on. I ran sar -g 5 10 and when a request was made (totally about 10 DB queries) my pgout/s jumped to 5.8 and my ppgout/s jumped to 121.8. pgfree/s also jumped to 121.80. I'm fairly sure that the pi and po numbers include file IO in Solaris, because of the unified VM and file systems. -- Alan ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Swapping on Solaris
Kevin Schroeder wrote: I may be asking the question the wrong way, but when I start up PostgreSQL swap is what gets used the most of. I've got 1282MB free RAM right now and and 515MB swap in use. Granted, swap file usage probably wouldn't be zero, but I would guess that it should be a lot lower so something must be keeping PostgreSQL from using the free RAM that my system is reporting. For example, one of my postgres processes is 201M in size but on 72M is resident in RAM. That extra 130M is available in RAM, according to top, but postgres isn't using it. The test you're doing doesn't measure what you think you're measuring. First, what else is running on the machine?Note that some shared memory allocations do reserve backing pages in swap, even though the pages aren't currently in use. Perhaps this is what you're measuring? swap -s has better numbers than top. You'd be better by trying a reboot then starting pgsql and seeing what memory is used. Just because you start a process and see the swap number increase doesn't mean that the new process is in swap. It means some anonymous pages had to be evicted to swap to make room for the new process or some pages had to be reserved in swap for future use. Typically a new process won't be paged out unless something else is causing enormous memory pressure... -- Alan ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Hmm...something just occurred to me about this. Would a hybrid approach be possible? That is, use mmap() to handle reads, and use write() to handle writes? Nope. Have you read the specs regarding mmap-vs-stdio synchronization? Basically it says that there are no guarantees whatsoever if you try this. The SUS text is a bit weaselly (the application must ensure correct synchronization) but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... We have some anecdotal experience along these lines:There was a set of kernel bugs in Solaris 2.6 or 7 related to this as well. We had several kernel panics and it took a bit to chase down, but the basic feedback was oops. we're screwed. I've forgotten most of the details right now; the basic problem was a file was being read+written via mmap and read()/write() at (essentially) the same time from the same pid. It would panic the system quite reliably. I believe the bugs related to this have been resolved in Solaris, but it was unpleasant to chase that problem down... -- Alan ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Excessive context switching on SMP Xeons
Bill Montgomery wrote: Alan Stange wrote: Here's a few numbers from the Opteron 250. If I get some time I'll post a more comprehensive comparison including some other systems. The system is a Sun v20z. Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB memory. I did a compile and install of pg 8.0 beta 3. I created a data base on a tmpfs file system and ran pgbench. Everything was out of the box, meaning I did not tweak any config files. I used this for pgbench: $ pgbench -i -s 32 and this for pgbench invocations: $ pgbench -s 32 -c 1 -t 1 -v clients tps 11290 2 1780 4176081680 16 1376 32904 The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache, HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5: $ pgbench -i -s 32 pgbench $ pgbench -s 32 -c 1 -t 1 -v clients tps avg CS/sec --- - -- 1601 48,000 2889 77,000 4 1006 80,000 8985 59,000 16966 47,000 32913 46,000 Far less performance that the Dual Opterons with a low number of clients, but the gap narrows as the number of clients goes up. Anyone smarter than me care to explain? boy, did Thunderbird ever botch the format of the table I entered... I thought the falloff at 32 clients was a bit steep as well. One thought that crossed my mind is that pgbench -s 32 -c 32 ... might not be valid. From the pgbench README: -s scaling_factor this should be used with -i (initialize) option. number of tuples generated will be multiple of the scaling factor. For example, -s 100 will imply 10M (10,000,000) tuples in the accounts table. default is 1. NOTE: scaling factor should be at least as large as the largest number of clients you intend to test; else you'll mostly be measuring update contention. Another possible cause is the that pgbench process is cpu starved and isn't able to keep driving the postgresql processes. So I ran pgbench from another system with all else the same.The numbers were a bit smaller but otherwise similar. I then reran everything using -s 64: clients tps 1 1254 2 1645 4 1713 8 1548 161396 321060 Still starting to head down a bit. In the 32 client case, the system was ~60% user time, ~25% sytem and ~15% idle. Anyway, the machine is clearly hitting some contention somewhere. It could be in the tmpfs code, VM system, etc. -- Alan ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Excessive context switching on SMP Xeons
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: A few quick random observations on the Xeon v. Opteron comparison: - running a dual Xeon with hyperthreading turned on really isn't the same as having a quad cpu system. I haven't seen postgresql specific benchmarks, but the general case has been that HT is a benefit in a few particular work loads but with no benefit in general. Part of the FUD with hyperthreading did have a kernel of truth that lied in older kernels' schedulers. For example with Linux until recently the kernel can easily end up scheduling two processes on the two virtual processors of one single physical processor, leaving the other physical processor totally idle. With modern kernels' schedulers I would expect hyperthreading to live up to its billing of adding 10% to 20% performance. Ie., a dual Xeon machine with hyperthreading won't be as fast as four processors, but it should be 10-20% faster than a dual Xeon without hyperthreading. As with all things that will only help if you're bound by the right limited resource to begin with. If you're I/O bound it isn't going to help. I would expect Postgres with its heavy demand on memory bandwidth and shared memory could potentially benefit more than usual from being able to context switch during pipeline stalls. All true. I'd be surprised if HT on an older 2.8 Ghz Xeon with only a 512K cache will see any real benefit. The dual Xeon is already memory starved, now further increase the memory pressure on the caches (because the 512K is now shared by two virtual processors) and you probably won't see a gain. It's memory stalls all around. To be clear, the context switch in this case isn't a kernel context switch but a virtual cpu context switch. The probable reason we see dual Opteron boxes way outperforming dual Xeons boxes is exactly because of Postgresql's heavy demand on memory. The Opteron's have a much better memory system. A quick search on google or digging around in the comp.arch archives will provide lots of details.HP's web site has (had?) some benchmarks comparing these systems. HP sells both Xeon and Opteron systems, so the comparison were quite fair. Their numbers showed the Opteron handily outperfoming the Xeons. -- Alan ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Excessive context switching on SMP Xeons
Here's a few numbers from the Opteron 250. If I get some time I'll post a more comprehensive comparison including some other systems. The system is a Sun v20z. Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB memory. I did a compile and install of pg 8.0 beta 3. I created a data base on a tmpfs file system and ran pgbench. Everything was out of the box, meaning I did not tweak any config files. I used this for pgbench: $ pgbench -i -s 32 and this for pgbench invocations: $ pgbench -s 32 -c 1 -t 1 -v clients tps 11290 21780 41760 81680 16 1376 32904 How are these results useful? In some sense, this is a speed of light number for the Opteron 250. You'll never go faster on this system with a real storage subsystem involved instead of a tmpfs file system. It's also a set of numbers that anyone else can reproduce as we don't have to deal with any differences in file systems, disk subsystems, networking, etc. Finally, it's a set of results that anyone else can compute on Xeon's or other systems and make a simple (and naive) comparisons. Just to stay on topic: vmstat reported about 30K cs / second while this was running the 1 and 2 client cases. -- Alan ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Excessive context switching on SMP Xeons
A few quick random observations on the Xeon v. Opteron comparison: - running a dual Xeon with hyperthreading turned on really isn't the same as having a quad cpu system. I haven't seen postgresql specific benchmarks, but the general case has been that HT is a benefit in a few particular work loads but with no benefit in general. - We're running postgresql 8 (in production!) on a dual Opteron 250, Linux 2.6, 8GB memory, 1.7TB of attached fiber channel disk, etc. This machine is fast.A dual 2.8 Ghz Xeon with 512K caches (with or without HT enabled) simlpy won't be in the same performance league as this dual Opteron system (assuming identical disk systems, etc). We run a Linux 2.6 kernel because it scales under load so much better than the 2.4 kernels. The units we're using (and we have a lot of them) are SunFire v20z. You can get a dualie Opteron 250 for $7K with 4GB memory from Sun. My personal experience with this setup in a mission critical config is to not depend on 4 hour spare parts, but to spend the money and install the spare in the rack. Naturally, one can go cheaper with slower cpus, different vendors, etc. I don't care to go into the whole debate of Xeon v. Opteron here. We also have a lot of dual Xeon systems. In every comparison I've done with our codes, the dual Opteron clearly outperforms the dual Xeon, when running on one and both cpus. -- Alan Josh Berkus wrote: Bill, I'd be thrilled to test it too, if for no other reason that to determine whether what I'm experiencing really is the CS problem. Hmmm ... Gavin's patch is built against 8.0, and any version of the patch would require linux 2.6, probably 2.6.7 minimum. Can you test on that linux version? Do you have the resources to back-port Gavin's patch? Fair enough. I never see nearly this much context switching on my dual Xeon boxes running dozens (sometimes hundreds) of concurrent apache processes, but I'll concede this could just be due to the more parallel nature of a bunch of independent apache workers. Certainly could be. Heavy CSes only happen when you have a number of long-running processes with contention for RAM in my experience. If Apache is dispatching thing quickly enough, they'd never arise. Hence my desire for recommendations on alternate architectures ;-) Well, you could certainly stay on Xeon if there's better support availability. Just get off Dell *650's. Being a 24x7x365 shop, and these servers being mission critical, I require vendors that can offer 24x7 4-hour part replacement, like Dell or IBM. I haven't seen 4-way 64-bit boxes meeting that requirement for less than $20,000, and that's for a very minimally configured box. A suitably configured pair will likely end up costing $50,000 or more. I would like to avoid an unexpected expense of that size, unless there's no other good alternative. That said, I'm all ears for a cheaper alternative that meets my support and performance requirements. No, you're going to pay through the nose for that support level. It's how things work. tps = 369.717832 (including connections establishing) tps = 370.852058 (excluding connections establishing) Doesn't seem too bad to me. Have anything to compare it to? What's in your postgresql.conf? --Josh ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Anyone familiar with Apple Xserve RAID
Doug McNaught wrote: Kevin Barnard [EMAIL PROTECTED] writes: Actually you are both are right and wrong. The XRaid uses FibreChannel to communicate to the host machine(s). The Raid controller is a FibreChannel controller. After that there is a FibreChannel to UltraATA conversion for each drive, separate ATA bus for each drive. What I am curious about is if this setup gets around ATA fsync problems, where the drive reports the write before it is actually performed. Good point. (a) The FC-ATA unit hopefully has a battery-backed cache, which would make the whole thing more robust against power loss. (b) Since Apple is the vendor for the drive units, they can buy ATA drives that don't lie about cache flushes. Whether they do or not is definitely a question. ;) FYI:http://developer.apple.com/technotes/tn/pdf/tn1040.pdf a tech note on write cache flushing. A bit dated now, but perhaps some other tech note from Apple has more recent information. -- Alan ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] linux distro for better pg performance
Joseph Shraibman wrote: J. Andrew Rogers wrote: Do these features make a difference? Far more than you would imagine. On one postgres server I just upgraded, we went from a 3Ware 8x7200-RPM RAID-10 configuration to an LSI 320-2 SCSI 3x10k RAID-5, with 256M Is raid 5 much faster than raid 10? On a 4 disk array with 3 data disks and 1 parity disk, you have to write 4/3rds the original data, while on raid 10 you have to write 2 times the original data, so logically raid 5 should be faster. I think this comparison is a bit simplistic. For example, most raid5 setups have full stripes that are more than 8K (the typical IO size in postgresql), so one might have to read in portions of the stripe in order to compute the parity. The needed bits might be in some disk or controller cache; if it's not then you lose. If one is able to perform full stripe writes then the raid5 config should be faster for writes. Note also that the mirror has 2 copies of the data, so that the read IOs would be divided across 2 (or more) spindles using round robin or a more advanced algorithm to reduce seek times. Of course, I might be completely wrong... -- Alan ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[PERFORM] vacuum performance
Hello all, I have a question/observation about vacuum performance. I'm running Solaris 9, pg 7.4.1. The process in questions is doing a vacuum: bash-2.05$ /usr/ucb/ps auxww | grep 4885 fiasco4885 19.1 3.7605896592920 ?O 19:29:44 91:38 postgres: fiasco fiasco [local] VACUUM I do a truss on the process and see the output below looping over and over. Note the constant opening and closing of the file 42064889.3. Why the open/close cycle as opposed to caching the file descriptor somewhere? If PG really does need to loop like this, it should be much faster to set the cwd and then open without the path in the file name. You're forcing the kernel to do a lot of work walking the path, checking for nfs mounts, symlinks, etc. Thanks! -- Alan open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x18F6E000, SEEK_SET)= 0x18F6E000 write(47, \0\0\0 zA9A3D9E8\0\0\0 .., 8192) = 8192 close(47) = 0 read(29, \0\0\0 }ED WF1B0\0\0\0 $.., 8192)= 8192 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x18F78000, SEEK_SET)= 0x18F78000 write(47, \0\0\0 zA9AC 090\0\0\0 .., 8192) = 8192 close(47) = 0 llseek(43, 0x26202000, SEEK_SET)= 0x26202000 read(43, \0\0\084 EC9FC P\0\0\0 ).., 8192)= 8192 semop(52, 0xFFBFC5E0, 1)= 0 semop(52, 0xFFBFC640, 1)= 0 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x18F62000, SEEK_SET)= 0x18F62000 write(47, \0\0\0 zA9C2\bB8\0\0\0 .., 8192) = 8192 close(47) = 0 read(29, \0\0\0 }ED X1210\0\0\0 $.., 8192)= 8192 semop(52, 0xFFBFC5E0, 1)= 0 semop(52, 0xFFBFC640, 1)= 0 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x18018000, SEEK_SET)= 0x18018000 write(47, \0\0\0 zA997ADB0\0\0\0 .., 8192) = 8192 close(47) = 0 llseek(43, 0x2620, SEEK_SET)= 0x2620 read(43, \0\0\084 EC4F5E8\0\0\0 ).., 8192)= 8192 semop(52, 0xFFBFC5E0, 1)= 0 semop(52, 0xFFBFC640, 1)= 0 llseek(13, 13918208, SEEK_SET) = 13918208 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x18F52000, SEEK_SET)= 0x18F52000 write(47, \0\0\0 zABE7 V10\0\0\0 .., 8192) = 8192 close(47) = 0 semop(46, 0xFFBFC5D0, 1)= 0 read(29, \0\0\0 }ED X 2 p\0\0\0 $.., 8192)= 8192 semop(52, 0xFFBFC5E0, 1)= 0 semop(52, 0xFFBFC640, 1)= 0 llseek(43, 0x270DA000, SEEK_SET)= 0x270DA000 write(43, \0\0\087A2E8 #B8\0\0\0 ).., 8192) = 8192 llseek(43, 0x261FE000, SEEK_SET)= 0x261FE000 read(43, \0\0\084 EC498\0\0\0\0 ).., 8192)= 8192 poll(0xFFBFC100, 0, 10) = 0 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 llseek(47, 0x1804A000, SEEK_SET)= 0x1804A000 write(47, \0\0\0 zAA0F8DE0\0\0\0 .., 8192) = 8192 close(47) = 0 read(29, \0\0\0 }ED X RD0\0\0\0 $.., 8192)= 8192 semop(52, 0xFFBFC5E0, 1)= 0 semop(52, 0xFFBFC640, 1)= 0 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 write(13, D0 Z\001\0\0\0 )\0\0\087.., 8192) = 8192 open64(/export/nst1/fi/pg/data1/base/91488/42064889.3, O_RDWR) = 47 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Insert Times
PC Drew wrote: I tested this out and saw no improvement: I'd still suspect some class loading issues and HotSpot compilation issues are polluting your numbers.Try using a PreparedStatement to another table first in order to make sure that classes bytecode has been loaded. There are some command line options to the JVM to have it print out some status info when it is loading classes and compiling methods; you might want to turn on those options as well. -- Alan EXPLAIN ANALYZE SELECT * FROM one; Seq Scan on one (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.50 rows=51 loops=1) Total runtime: 0.75 msec EXPLAIN ANALYZE SELECT * FROM one; Seq Scan on one (cost=0.00..20.00 rows=1000 width=404) (actual time=0.06..0.50 rows=51 loops=1) Total runtime: 0.64 msec EXPLAIN ANALYZE SELECT * FROM one; Seq Scan on one (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.40 rows=51 loops=1) Total runtime: 0.54 msec EXPLAIN ANALYZE SELECT * FROM one; Seq Scan on one (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.41 rows=51 loops=1) Total runtime: 0.54 msec EXPLAIN ANALYZE SELECT * FROM one; Seq Scan on one (cost=0.00..20.00 rows=1000 width=404) (actual time=0.04..0.41 rows=51 loops=1) Total runtime: 0.53 msec EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah'); Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.02 rows=1 loops=1) Total runtime: 0.85 msec EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah'); Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1) Total runtime: 0.15 msec EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah'); Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1) Total runtime: 0.14 msec EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah'); Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.02..0.02 rows=1 loops=1) Total runtime: 0.12 msec EXPLAIN ANALYZE INSERT INTO one (id, msg) VALUES (1, 'blah'); Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.01..0.02 rows=1 loops=1) Total runtime: 0.12 msec -Original Message- From: Leeuw van der, Tim [mailto:[EMAIL PROTECTED] Sent: Tue 1/27/2004 12:38 AM To: PC Drew; [EMAIL PROTECTED] Cc: Subject:RE: [PERFORM] Insert Times Hi, My personal feeling on this is, that the long time taken for the first query is for loading all sorts of libraries, JVM startup overhead etc. What if you first do some SELECT (whatever), on a different table, to warm up the JVM and the database? regards, --Tim THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match