Re: [PERFORM] Practical upper limits of pgbench read/write tps with 8.3
On Mon, Jul 7, 2008 at 3:22 PM, Greg Smith <[EMAIL PROTECTED]> wrote: > On Mon, 7 Jul 2008, Jeffrey Baker wrote: > >> On the single 2.2GHz Athlon, the maximum tps seems to be 1450...what's the >> bottleneck? Is PG lock-bound? > > It can become lock-bound if you don't make the database scale significantly > larger than the number of clients, but that's probably not your problem. > The pgbench client driver program itself is pretty CPU intensive and can > suffer badly from kernel issues. I am unsurprised you can only hit 1450 > with a single CPU. On systems with multiple CPUs where the single CPU > running the pgbench client is much faster than your 2.2GHz Athlon, you'd > probably be able to get a few thousand TPS, but eventually the context > switching of the client itself can become a bottleneck. On a 2GHz Core 2 Duo the best tps achieved is 2300, with -c 8. pgbench itself gets around 10% of the CPU (user + sys for pgbench is 7s of 35s wall clock time, or 70 CPU-seconds, thus 10%). I suppose you could still blame it on ctxsw between pgbench and pg itself, but the results are not better with pgbench on another machine cross-connected with gigabit ethernet. -jwb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Practical upper limits of pgbench read/write tps with 8.3
On Mon, 7 Jul 2008, Jeffrey Baker wrote: On the single 2.2GHz Athlon, the maximum tps seems to be 1450...what's the bottleneck? Is PG lock-bound? It can become lock-bound if you don't make the database scale significantly larger than the number of clients, but that's probably not your problem. The pgbench client driver program itself is pretty CPU intensive and can suffer badly from kernel issues. I am unsurprised you can only hit 1450 with a single CPU. On systems with multiple CPUs where the single CPU running the pgbench client is much faster than your 2.2GHz Athlon, you'd probably be able to get a few thousand TPS, but eventually the context switching of the client itself can become a bottleneck. Running pgbench against a RAM disk is a good way to find out where the system bottlenecks at without disk I/O involvement, you might try that test on your larger server when you get a chance. One interesting thing to watch you may not have tried yet is running top and seeing how close to a single CPU pgbench itself is running at. If you've got 4 CPUs, and the pgbench client program shows 25% utilization, it is now the bottleneck rather than whatever you thought you were measuring. I thought this might be the case in the last test results you reported on Friday but didn't have a chance to comment on it until now. One thing you can try here is running pgbench itself on another server than the one hosting the database, but that seems to top out at a few thousand TPS as well; may get higher than you've been seeing though. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Practical upper limits of pgbench read/write tps with 8.3
I'm spending a third day testing with the ioDrive, and it occurred to me that I should normalize my tests by mounting the database on a ramdisk. The results were surprisingly low. On the single 2.2GHz Athlon, the maximum tps seems to be 1450. This is achieved with a single connection. I/O rates to and from the ramdisk never exceed 50MB/s on a one-minute average. With the flash device on the same benchmark, the tps rate is 1350, meaning that as far as PostgreSQL is concerned, on this machine, the flash device achieves 90% of the best possible performance. Question being: what's the bottleneck? Is PG lock-bound? -jwb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Fusion-io ioDrive
PFC, I have to say these kind of posts make me a fan of yours. I've read many of your storage-related replied and have found them all very educational. I just want to let you know I found your assessment of the impact of Flash storage perfectly-worded and unbelievably insightful. Thanks a million for sharing your knowledge with the list. -Dan Hehe, thanks. There was a time when you had to be a big company full of cash to build a computer, and then sudenly people did it in garages, like Wozniak and Jobs, out of off-the-shelf parts. I feel the ioDrive guys are the same kind of hackers, except today's hackers have much more powerful tools. Perhaps, and I hope it's true, storage is about to undergo a revolution like the personal computer had 20-30 years ago, when the IBMs of the time were eaten from the roots up. IMHO the key is that you can build a ioDrive from off the shelf parts, but you can't do that with a disk drive. Flash manufacturers are smelling blood, they profit from USB keys and digicams but imagine the market for solid state drives ! And in this case the hardware is simple : flash, ram, a fpga, some chips, nothing out of the ordinary, it is the brain juice in the software (which includes FPGAs) which will sort out the high performance and reliability winners from the rest. Lowering the barrier of entry is good for innovation. I believe Linux will benefit, too, since the target is (for now) high-performance servers, and as shown by the ioDrive, innovating hackers prefer to write Linux drivers rather than Vista (argh) drivers. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Fusion-io ioDrive
On Mon, Jul 7, 2008 at 6:08 AM, Merlin Moncure <[EMAIL PROTECTED]> wrote: > On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <[EMAIL PROTECTED]> wrote: >>>Service Time Percentile, millis >>>R/W TPS R-O TPS 50th 80th 90th 95th >>> RAID 182 673 18 32 42 64 >>> Fusion971 4792 8 9 10 11 >> >> Someone asked for bonnie++ output: >> >> Block output: 495MB/s, 81% CPU >> Block input: 676MB/s, 93% CPU >> Block rewrite: 262MB/s, 59% CPU >> >> Pretty respectable. In the same ballpark as an HP MSA70 + P800 with >> 25 spindles. > > You left off the 'seeks' portion of the bonnie++ results -- this is > actually the most important portion of the test. Based on your tps > #s, I'm expecting seeks equiv of about 10 10k drives in configured in > a raid 10, or around 1000-1500. They didn't publish any prices so > it's hard to say if this is 'cost competitive'. I left it out because bonnie++ reports it as "+" i.e. greater than or equal to 10 per second. -jwb -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Fusion-io ioDrive
On Mon, Jul 7, 2008 at 9:23 AM, Merlin Moncure <[EMAIL PROTECTED]> wrote: > I have a lot of problems with your statements. First of all, we are > not really talking about 'RAM' storage...I think your comments would > be more on point if we were talking about mounting database storage > directly from the server memory for example. Sever memory and cpu are > involved to the extent that the o/s using them for caching and > filesystem things and inside the device driver. I'm not sure how those cards work, but my guess is that the CPU will go 100% busy (with a near-zero I/O wait) on any sizable workload. In this case, the current pgbench configuration being used is quite small and probably won't resemble this. > Also, your comments seem to indicate that having a slower device leads > to higher concurrency because it allows the process to yield and do > other things. This is IMO simply false. Argue all you want, but this is a fairly well known (20+ year-old) behavior. > With faster storage cpu loads will increase but only because the overall > system throughput increases and cpu/memory 'work' increases in terms > of overall system activity. Again, I said that response times (throughput) would improve. I'd like to see your argument for explaining how you can handle more CPU-only operations when 0% of the CPU is free for use. > Presumably as storage approaches speedsof main system memory > the algorithms of dealing with it will become simpler (not having to > go through acrobatics to try and making everything sequential) > and thus faster. We'll have to see. > I also find the remarks of software 'optimizing' for strict hardware > assumptions (L1+L2) cache a little suspicious. In some old programs I > remember keeping a giant C 'union' of critical structures that was > exactly 8k to fit in the 486 cpu cache. In modern terms I think that > type of programming (sans some specialized environments) is usually > counter-productive...I think PostgreSQL's approach of deferring as > much work as possible to the o/s is a great approach. All of the major database vendors still see an immense value in optimizing their algorithms and memory structures for specific platforms and CPU caches. Hence, if they're *paying* money for very-specialized industry professionals to optimize in this way, I would hesitate to say there isn't any value in it. As a fact, Postgres doesn't have those low-level resources, so for the most part, I have to agree that they have to rely on the OS. -Jonah -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Fusion-io ioDrive
*) is the flash random write problem going to be solved in hardware or specialized solid state write caching techniques. At least currently, it seems like software is filling the role. Those flash chips are page-based, not unlike a harddisk, ie. you cannot erase and write a byte, you must erase and write a full page. Size of said page depends on the chip implementation. I don't know which chips they used so cannot comment there, but you can easily imagine that smaller pages yield faster random IO write throughput. For reads, you must first select a page and then access it. Thus, it is not like RAM at all. It is much more similar to a harddisk with an almost zero seek time (on reads) and a very small, but significant seek time (on writes) because a page must be erased before being written. Big flash chips include ECC inside to improve reliability. Basically the chips include a small static RAM buffer. When you want to read a page it is first copied to SRAM and ECC checked. When you want to write a page you first write it to SRAM and then order the chip to write it to flash. Usually you can't erase a page, you must erase a block which contains many pages (this is probably why most flash SSDs suck at random writes). NAND flash will never replace SDRAM because of these restrictions (NOR flash acts like RAM but it is slow and has less capacity). However NAND flash is well suited to replace harddisks. When writing a page you write it to the small static RAM buffer on the chip (fast) and tell the chip to write it to flash (slow). When the chip is busy erasing or writing you can not do anything with it, but you can still talk to the other chips. Since the ioDrive has many chips I'd bet they use this feature. I don't know about the ioDrive implementation but you can see that the paging and erasing requirements mean some tricks have to be applied and the thing will probably need some smart buffering in RAM in order to be fast. Since the data in a flash doesn't need to be sequential (read seek time being close to zero) it is possible they use a system which makes all writes sequential (for instance) which would suit the block erasing requirements very well, with the information about block mapping stored in RAM, or perhaps they use some form of copy-on-write. It would be interesting to dissect this algorithm, especially the part which allows to store permanently the block mappings, which cannot be stored in a constant known sector since it would wear out pretty quickly. Ergo, in order to benchmark this thing and get relevant results, I would tend to think that you'd need to fill it to say, 80% of capacity and bombard it with small random writes, the total amount of data being written being many times more than the total capacity of the drive, in order to test the remapping algorithms which are the weak point of such a device. *) do the software solutions really work (unproven) *) when are the major hardware vendors going to get involved. they make a lot of money selling disks and supporting hardware (san, etc). Looking at the pictures of the "drive" I see a bunch of Flash chips which probably make the bulk of the cost, a switching power supply, a small BGA chip which is probably a DDR memory for buffering, and the mystery ASIC which is probably a FPGA, I would tend to think Virtex4 from the shape of the package seen from the side in one of the pictures. A team of talented engineers can design and produce such a board, and assembly would only use standard PCB processes. This is unlike harddisks, which need a huge investment and a specialized factory because of the complex mechanical parts and very tight tolerances. In the case of the ioDrive, most of the value is in the intellectual property : software on the PC CPU (driver), embedded software, and programming the FPGA. All this points to a very different economic model for storage. I could design and build a scaled down version of the ioDrive in my "garage", for instance (well, the PCI Express licensing fees are hefty, so I'd use PCI, but you get the idea). This means I think we are about to see a flood of these devices coming from many small companies. This is very good for the end user, because there will be competition, natural selection, and fast evolution. Interesting times ahead ! I'm not particularly enamored of having a storage device be stuck directly in a pci slot -- although I understand it's probably necessary in the short term as flash changes all the rules and you can't expect it to run well using mainstream hardware raid controllers. By using their own device they have complete control of the i/o stack up to the o/s driver level. Well, SATA is great for harddisks : small cables, less clutter, less failure prone than 80 conductor cables, faster, cheaper, etc... Basically serial LVDS (low vol
Re: [PERFORM] Fusion-io ioDrive
On Wed, Jul 2, 2008 at 7:41 AM, Jonah H. Harris <[EMAIL PROTECTED]> wrote: > On Tue, Jul 1, 2008 at 8:18 PM, Jeffrey Baker <[EMAIL PROTECTED]> wrote: >> Basically the ioDrive is smoking the RAID. The only real problem with >> this benchmark is that the machine became CPU-limited rather quickly. > > That's traditionally the problem with everything being in memory. > Unless the database algorithms are designed to exploit L1/L2 cache and > RAM, which is not the case for a disk-based DBMS, you generally lose > some concurrency due to the additional CPU overhead of playing only > with memory. This is generally acceptable if you're going to trade > off higher concurrency for faster service times. And, it isn't only > evidenced in single systems where a disk-based DBMS is 100% cached, > but also in most shared-memory clustering architectures. > > In most cases, when you're waiting on disk I/O, you can generally > support higher concurrency because the OS can utilize the CPU's free > cycles (during the wait) to handle other users. In short, sometimes, > disk I/O is a good thing; it just depends on what you need. I have a lot of problems with your statements. First of all, we are not really talking about 'RAM' storage...I think your comments would be more on point if we were talking about mounting database storage directly from the server memory for example. Sever memory and cpu are involved to the extent that the o/s using them for caching and filesystem things and inside the device driver. Also, your comments seem to indicate that having a slower device leads to higher concurrency because it allows the process to yield and do other things. This is IMO simply false. With faster storage cpu loads will increase but only because the overall system throughput increases and cpu/memory 'work' increases in terms of overall system activity. Presumably as storage approaches speeds of main system memory the algorithms of dealing with it will become simpler (not having to go through acrobatics to try and making everything sequential) and thus faster. I also find the remarks of software 'optimizing' for strict hardware assumptions (L1+L2) cache a little suspicious. In some old programs I remember keeping a giant C 'union' of critical structures that was exactly 8k to fit in the 486 cpu cache. In modern terms I think that type of programming (sans some specialized environments) is usually counter-productive...I think PostgreSQL's approach of deferring as much work as possible to the o/s is a great approach. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Fusion-io ioDrive
On Sat, Jul 5, 2008 at 2:41 AM, Jeffrey Baker <[EMAIL PROTECTED]> wrote: >>Service Time Percentile, millis >>R/W TPS R-O TPS 50th 80th 90th 95th >> RAID 182 673 18 32 42 64 >> Fusion971 4792 8 9 10 11 > > Someone asked for bonnie++ output: > > Block output: 495MB/s, 81% CPU > Block input: 676MB/s, 93% CPU > Block rewrite: 262MB/s, 59% CPU > > Pretty respectable. In the same ballpark as an HP MSA70 + P800 with > 25 spindles. You left off the 'seeks' portion of the bonnie++ results -- this is actually the most important portion of the test. Based on your tps #s, I'm expecting seeks equiv of about 10 10k drives in configured in a raid 10, or around 1000-1500. They didn't publish any prices so it's hard to say if this is 'cost competitive'. These numbers are indeed fantastic, disruptive even. If I was testing the device for consideration in high duty server environments, I would be doing durability testing right now...I would slamming the database with transactions (fsync on, etc) and then power off the device. I would do this several times...making sure the software layer isn't doing some mojo that is technically cheating. I'm not particularly enamored of having a storage device be stuck directly in a pci slot -- although I understand it's probably necessary in the short term as flash changes all the rules and you can't expect it to run well using mainstream hardware raid controllers. By using their own device they have complete control of the i/o stack up to the o/s driver level. I've been thinking for a while now that flash is getting ready to explode into use in server environments. The outstanding questions I see are: *) is write endurance problem truly solved (giving at least a 5-10 year lifetime) *) what are the true odds of catastrophic device failure (industry claims less, we'll see) *) is the flash random write problem going to be solved in hardware or specialized solid state write caching techniques. At least currently, it seems like software is filling the role. *) do the software solutions really work (unproven) *) when are the major hardware vendors going to get involved. they make a lot of money selling disks and supporting hardware (san, etc). merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] How much work_mem to configure...
In response to "Scott Marlowe" <[EMAIL PROTECTED]>: > On Sat, Jul 5, 2008 at 5:24 AM, Jessica Richard <[EMAIL PROTECTED]> wrote: > > How can I tell if my work_mem configuration is enough to support all > > Postgres user activities on the server I am managing? > > > > Where do I find the indication if the number is lower than needed. > > You kinda have to do some math with fudge factors involved. As > work_mem gets smaller, sorts spill over to disk and get slower, and > hash_aggregate joins get avoided because they need to fit into memory. > > As you increase work_mem, sorts can start happening in memory (or with > less disk writing) and larger and larger sets can have hash_agg joins > performed on them because they can fit in memory. > > But there's a dark side to work_mem being too large, and that is that > you can run your machine out of free memory with too many large sorts > happening, and then the machine will slow to a crawl as it swaps out > the very thing you're trying to do in memory. > > So, I tend to plan for about 1/4 of memory used for shared_buffers, > and up to 1/4 used for sorts so there's plenty of head room and the OS > to cache files, which is also important for performance. If you plan > on having 20 users accessing the database at once, then you figure > each one might on average run a query with 2 sorts, and that you'll be > using a maximum of 20*2*work_mem for those sorts etc... > > If it's set to 8M, then you'd get approximately 320 Meg max used by > all the sorts flying at the same time. You can see why high work_mem > and high max_connections settings together can be dangerous. and why > pooling connections to limit the possibility of such a thing is useful > too. > > Generally it's a good idea to keep it in the 4 to 16 meg range on most > machines to prevent serious issues, but if you're going to allow 100s > of connections at once, then you need to look at limiting it based on > how much memory your server has. I do have one thing to add: if you're using 8.3, there's a log_temp_files config variable that you can use to monitor when your sorts spill over onto disk. It doesn't change anything that Scott said, it simply gives you another way to monitor what's happening and thus have better information to tune by. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ [EMAIL PROTECTED] Phone: 412-422-3463x4023 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance