Re: [HACKERS] 8192 BLCKSZ ?]
Don Baccus wrote: ... I expect TOAST to work even better). Users will still be able change to larger blocksizes (perhaps a wise thing to do if a large percentage of their data won't fit into a single PG block). Users using the default will be able to store rows of *awesome* length, efficiently. Depends... Actually the toaster already jumps in if your tuples exceed BLKSZ/4, so with the default of 8K blocks it tries to keep all tuples smaller than 2K. The reasons behind that are: 1. An average tuple size of 8K means an average of 4K unused space at the end of each block. Wasting space means to waste IO bandwidth. 2. Since big items are unlikely to be search criteria, needing to read them into memory for every chech for a match on other columns is a waste again. So the more big items are off from the main tuple, the smaller the main table becomes, the more likely it is that the main tuples (holding the keys) are cached and the cheaper a sequential scan becomes. Of course, especially for 2. there is a break even point. That is when the extra fetches to send toast values to the client cost more than there was saved from not doing it during the main scan already. A full table SELECT * definitely costs more if TOAST is involved. But who does unqualified SELECT * from a multi-gig table without problems anyway? Usually you pick a single or a few based on some other key attributes - don't you? Let's make an example. You have a forum server that displays one article plus the date and sender of all follow-ups. The article bodies are usually big (1-10K). So you do a SELECT * to fetch the actually displayed article, and another SELECT sender, date_sent just to get the info for the follow-ups. If we assume a uniform distribution of body size and an average of 10 follow-ups, that'd mean that we save 52K of IO and cache usage for each article displayed. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] #
RE: [HACKERS] 8192 BLCKSZ ?]
The cost difference between 32K vs 8K disk reads/writes are so small these days when compared with overall cost of the disk operation itself, that you can even measure it, well below 1%. Remember seek times advertised on disks are an average. It has been said how small the difference is - therefore in my opinion it should remain at 8KB to maintain best average performance with all existing platforms. I say its best let the OS and mass storage subsystem worry about read-ahead caching and whether they actually read 8KB off the disk, or 32KB or 64KB when we ask for 8. - Andrew
Re: [Fwd: Re: [HACKERS] 8192 BLCKSZ ?]
Kevin O'Gorman wrote: mlw wrote: Tom Samplonius wrote: On Tue, 28 Nov 2000, mlw wrote: Tom Samplonius wrote: On Mon, 27 Nov 2000, mlw wrote: This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. I think it is a pretty wild assumption to say that 32k is more efficient than 8k. Considering how blocks are used, 32k may be in fact quite a bit slower than 8k blocks. I'm not so sure I agree. Perhaps I am off base here, but I did a bit of OS profiling a while back when I was doing a DICOM server. I experimented with block sizes and found that the best throughput on Linux and Windows NT was at 32K. The graph I created showed a steady increase in performance and a drop just after 32K, then steady from there. In Windows NT it was more pronounced than it was in Linux, but Linux still exhibited a similar trait. You are a bit off base here. The typical access pattern is random IO, not sequentional. If you use a large block size in Postgres, Postgres will read and write more data than necessary. Which is faster? 1000 x 8K IOs? Or 1000 x 32K IOs I can sort of see your point, but the 8K vs 32K is not a linear relationship. The big hit is the disk I/O operation, more so than just the data size. It may be almost as efficient to write 32K as it is to write 8K. While I do not know the exact numbers, and it varies by OS and disk subsystem, I am sure that writing 32K is not even close to 4x more expensive than 8K. Think about seek times, writing anything to the disk is expensive regardless of the amount of data. Most disks today have many heads, and are RL encoded. It may only add 10us (approx. 1-2 sectors of a 64 sector drive spinning 7200 rpm) to a disk operation which takes an order of magnitude longer positioning the heads. The overhead of an additional 24K is minute compared to the cost of a disk operation. So if any measurable benefit can come from having bigger buffers, i.e. having more data available per disk operation, it will probably be faster. This is only part of the story. It applies best when you're going to use sequential scans, for instance, or otherwise use all the info in any block that you fetch. However, when your blocks are 8x bigger, your number of blocks in the disk cache is 8x fewer. If you're accessing random blocks, your hopes of finding the block in the cache are affected (probably not 8x, but there is an effect). So don't just blindly think that bigger blocks are better. It ain't necessarily so. First, the difference between 8K and 32K is 4 not 8. The problem is you are looking at these numbers as if there is a linear relationship between the 8 and the 32. You are thinking 8 is 1/4 the size of 32, so it must be 1/4 the amount of work. This is not true at all. Many operating systems used a fixed memory block size allocation for their disk cache. They do not allocate a new block for every disk request, they maintain a pool of fixed sized buffer blocks. So if you use fewer bytes than the OS block size you waste the difference between your block size and the block size of the OS cache entry. I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm pretty confident that NT does as well from my previous tests. So, in effect, an 8K block may waste 3/4 of the memory in the disk cache. http://www.mohawksoft.com
Re: [HACKERS] 8192 BLCKSZ ?
Matthew Kirkwood wrote: On Tue, 28 Nov 2000, Tom Lane wrote: Nathan Myers [EMAIL PROTECTED] writes: In the event of a power outage, the drive will stop writing in mid-sector. Really? Any competent drive firmware designer would've made sure that can't happen. The drive has to detect power loss well before it actually loses control of its actuators, because it's got to move the heads to the safe landing zone. If it checks for power loss and starts that shutdown process between sector writes, never in the middle of one, voila: atomic writes. In principle, that is correct. However, the SGI XFS people have apparently found otherwise -- what can happen is that the drive itself has enough power to complete a write, but that the disk/controller buffers lose power and so you end up writing a (perhaps partial) block of zeroes. I have worked on a few systems that intend to take a hard power failure gracefully. It is a very hard thing to do, with a lot of specialized circuitry. While it is nice to think about, on a normal computer systems one can not depend on a system shutting down gracefully on a hard power loss without a smart UPS and daemon to shut down the system. It does not matter one bit about disk write sizes or what ever. Unless the computer can know it is about to lose power, it can not halt its operations and enter a safe mode. The whole "pull the plug" mentality is silly. Unless a system hardware is specifically designed to manage this and proper software in place, it can not be done, and any "compliance" you think you see is simply luck. Any computer that has important data should have a smart UPS and a daemon to manage it. -- http://www.mohawksoft.com
Re: [Fwd: Re: [HACKERS] 8192 BLCKSZ ?]
Kevin O'Gorman wrote: mlw wrote: Many operating systems used a fixed memory block size allocation for their disk cache. They do not allocate a new block for every disk request, they maintain a pool of fixed sized buffer blocks. So if you use fewer bytes than the OS block size you waste the difference between your block size and the block size of the OS cache entry. I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm pretty confident that NT does as well from my previous tests. I dunno about NT, but here's a quote from "Linux Kernel Internals" 2nd Ed, page 92-93: .. The block size for any given device may be 512, 1024, 2048 or 4096 bytes ... the buffer cache manages individual block buffers of varying size. For this, every block is given a 'buffer_head' data structure. ... The definition of the buffer head is in linux/fs.h ... the size of this area exactly matches the block size 'b_size'... The quote goes on to describe how the data structures are designed to be processor-cache-aware. I double checked the kernel source, and you are right. I stand corrected about the disk caching. My assertion stands, it is a neglagable difference to read 32K vs 8K from a disk, and the probability of data being within a 4 times larger block is 4 times better, even though the probability of having the correct block in memory is 4 times less. So, I don't think it is a numerically significant issue. -- http://www.mohawksoft.com
Re: [HACKERS] 8192 BLCKSZ ?
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote: "Christopher Kings-Lynne" [EMAIL PROTECTED] writes: I don't believe it's a performance issue, I believe it's that writes to blocks greater than 8k cannot be guaranteed 'atomic' by the operating system. Hence, 32k blocks would break the transactions system. As Nathan remarks nearby, it's hard to tell how big a write can be assumed atomic, unless you have considerable knowledge of your OS and hardware. Not to harp on the subject, but even if you _do_ know a great deal about your OS and hardware, you _still_ can't assume any write is atomic. To give an idea of what is involved, consider that modern disk drives routinely re-order writes, by themselves. You think you have asked for a sequential write of 8K bytes, or 16 sectors, but the disk might write the first and last sectors first, and then the middle sectors in random order. A block of all zeroes might not be written at all, but just noted in the track metadata. Most disks have a "feature" that they report the write complete as soon as it is in the RAM cache, rather than after the sectors are on the disk. (It's a "feature" because it makes their benchmarks come out better.) It can usually be turned off, but different vendors have different ways to do it. Have you turned it off on your production drives? In the event of a power outage, the drive will stop writing in mid-sector. If you're lucky, that sector would have a bad checksum if you tried to read it. If the half-written sector happens to contain track metadata, you might have a bigger problem. The short summary is: for power outage or OS-crash recovery purposes, there is no such thing as atomicity. This is why backups and transaction logs are important. "Invest in a UPS." Use a reliable OS, and operate it in a way that doesn't stress it. Even a well-built OS will behave oddly when resources are badly stressed. (That the oddities may be documented doesn't really help much.) For performance purposes, it may be more or less efficient to group writes into 4K, 8K, or 32K chunks. That's not a matter of database atomicity, but of I/O optimization. It can only confuse people to use "atomicity" in that context. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] 8192 BLCKSZ ?
On Tue, Nov 28, 2000 at 04:24:34PM -0500, Tom Lane wrote: Nathan Myers [EMAIL PROTECTED] writes: In the event of a power outage, the drive will stop writing in mid-sector. Really? Any competent drive firmware designer would've made sure that can't happen. The drive has to detect power loss well before it actually loses control of its actuators, because it's got to move the heads to the safe landing zone. If it checks for power loss and starts that shutdown process between sector writes, never in the middle of one, voila: atomic writes. I used to think that way too, because that's how I would design a drive. (Anyway that would still only give you 512-byte-atomic writes, which isn't enough.) Talking to people who build them was a rude awakening. They have apparatus to yank the head off the drive and lock it away when the power starts to go down, and it will happily operate in mid-write. (It's possible that some drives are made the way Tom describes, but evidently not the commodity stuff.) The level of software-development competence, and of reliability engineering, that I've seen among disk drive firmware maintainers distresses me whenever I think about it. A disk drive is best considered as throwaway cache image of your real medium. Of course, there's still no guarantee if you get a hardware failure or sector write failure (recovery from the write failure might well take longer than the drive has got). But guarding against a plain power-failure scenario is actually simpler than doing it the wrong way. If only the disk-drive vendors (and buyers!) thought that way... Nathan Myers [EMAIL PROTECTED]
[Fwd: Re: [HACKERS] 8192 BLCKSZ ?]
Tom Samplonius wrote: On Tue, 28 Nov 2000, mlw wrote: Tom Samplonius wrote: On Mon, 27 Nov 2000, mlw wrote: This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. I think it is a pretty wild assumption to say that 32k is more efficient than 8k. Considering how blocks are used, 32k may be in fact quite a bit slower than 8k blocks. I'm not so sure I agree. Perhaps I am off base here, but I did a bit of OS profiling a while back when I was doing a DICOM server. I experimented with block sizes and found that the best throughput on Linux and Windows NT was at 32K. The graph I created showed a steady increase in performance and a drop just after 32K, then steady from there. In Windows NT it was more pronounced than it was in Linux, but Linux still exhibited a similar trait. You are a bit off base here. The typical access pattern is random IO, not sequentional. If you use a large block size in Postgres, Postgres will read and write more data than necessary. Which is faster? 1000 x 8K IOs? Or 1000 x 32K IOs I can sort of see your point, but the 8K vs 32K is not a linear relationship. The big hit is the disk I/O operation, more so than just the data size. It may be almost as efficient to write 32K as it is to write 8K. While I do not know the exact numbers, and it varies by OS and disk subsystem, I am sure that writing 32K is not even close to 4x more expensive than 8K. Think about seek times, writing anything to the disk is expensive regardless of the amount of data. Most disks today have many heads, and are RL encoded. It may only add 10us (approx. 1-2 sectors of a 64 sector drive spinning 7200 rpm) to a disk operation which takes an order of magnitude longer positioning the heads. The overhead of an additional 24K is minute compared to the cost of a disk operation. So if any measurable benefit can come from having bigger buffers, i.e. having more data available per disk operation, it will probably be faster.
Re: [HACKERS] 8192 BLCKSZ ?
I've been using a 32k BLCKSZ for months now without any trouble, though I've not benchmarked it to see if it's any faster than one with a BLCKSZ of 8k.. -Mitch This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. Has anyone done any tests to see if a default 32K block would provide a better overall performance? 8K seems so small, and 32K looks to be where most x86 operating systems seem to have a sweet spot. If someone has the answer off the top of their head, and I'm just being stupid, let me have it. However, I have needed to up the block size to 32K for a text management system and have seen no performance problems. (It has not been a scientific experiment, admittedly.) This isn't a rant, but my gut tells me that a 32k block size as default would be better, and that smaller deployments should adjust down as needed.
Re: [HACKERS] 8192 BLCKSZ ?
Nothing is guaranteed for anything larger than 512 bytes, and even then you have maybe 1e-13 likelihood of a badly-written block written during a power outage going unnoticed. (That is why the FAQ recommends you invest in a UPS.) If PG crashes, you're covered, regardless of block size. If the OS crashes, you're not. If the power goes out, you're not. The block size affects how much is written when you change only a single record within a block. When you update a two-byte field in a 100-byte record, do you want to write 32k? (The answer is "maybe".) Nathan Myers [EMAIL PROTECTED] On Tue, Nov 28, 2000 at 09:14:15AM +0800, Christopher Kings-Lynne wrote: I don't believe it's a performance issue, I believe it's that writes to blocks greater than 8k cannot be guaranteed 'atomic' by the operating system. Hence, 32k blocks would break the transactions system. (Or something like that - am I correct?) From: [EMAIL PROTECTED] On Behalf Of Mitch Vincent Sent: Tuesday, November 28, 2000 8:40 AM Subject: Re: [HACKERS] 8192 BLCKSZ ? I've been using a 32k BLCKSZ for months now without any trouble, though I've not benchmarked it to see if it's any faster than one with a BLCKSZ of 8k.. This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. Has anyone done any tests to see if a default 32K block would provide a better overall performance? 8K seems so small, and 32K looks to be where most x86 operating systems seem to have a sweet spot. If someone has the answer off the top of their head, and I'm just being stupid, let me have it. However, I have needed to up the block size to 32K for a text management system and have seen no performance problems. (It has not been a scientific experiment, admittedly.) This isn't a rant, but my gut tells me that a 32k block size as default would be better, and that smaller deployments should adjust down as needed.
Re: [HACKERS] 8192 BLCKSZ ?
At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote: [ Charset ISO-8859-1 unsupported, converting... ] If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the box this database is running on gets hit pretty hard and I haven't had a single ounce of trouble since I went to 7.0.X Larger block sizes mean larger blocks in the cache, therefore fewer blocks per megabyte. The more granular the cache, the better. Well, true, but when you have 256 MB or a half-gig or more to devote to the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a pain for a lot of folks. Though the entire discussion's moot with PG 7.1, with the removal of the tuple-size limit, it has been unfortunate that the fact that a blocksize of up to 32KB can easily be configured at build time hasn't been printed in a flaming-red oversized font on the front page of www.postgresql.org. THE ENTIRE WORLD seems to believe that PG suffers from a hard-wired 8KB limit on tuple size, rather than simply defaulting to that limit. When I tell the heathens that the REAL limit is 32KB, they're surprised, amazed, pleased etc. This default has unfairly contributed to the poor reputation PG has suffered from for so long due to widespread ignorance that it's only a default, easily changed. For instance the November Linux Journal has a column on PG, favorable but mentions the 8KB limit as though it's absolute. Tim Perdue's article on PHP Builder implied the same when he spoke of PG 7.1 removing the limit. Again, PG 7.1 removes the issue entirely, but it is ironic that so many people had heard that PG suffered from a hard-wired 8KB limit on tuple length... - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] 8192 BLCKSZ ?
At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote: [ Charset ISO-8859-1 unsupported, converting... ] If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the box this database is running on gets hit pretty hard and I haven't had a single ounce of trouble since I went to 7.0.X Larger block sizes mean larger blocks in the cache, therefore fewer blocks per megabyte. The more granular the cache, the better. Well, true, but when you have 256 MB or a half-gig or more to devote to the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a pain for a lot of folks. Agreed. The other problem is that most people have 2-4MB of cache, so a 32k default would be too big for them. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] 8192 BLCKSZ ?
At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote: Well, true, but when you have 256 MB or a half-gig or more to devote to the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a pain for a lot of folks. Agreed. The other problem is that most people have 2-4MB of cache, so a 32k default would be too big for them. I've always been fine with the default, and in fact agree with it. The OpenACS project recommends a 16KB default for PG 7.0, but that's only so we can hold reasonable-sized lzText strings in forum tables, etc. I was only lamenting the fact that the world seems to have the impression that it's not a default, but rather a hard-wired limit. - Don Baccus, Portland OR [EMAIL PROTECTED] Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Service and other goodies at http://donb.photo.net.
Re: [HACKERS] 8192 BLCKSZ ?
"Christopher Kings-Lynne" [EMAIL PROTECTED] writes: I don't believe it's a performance issue, I believe it's that writes to blocks greater than 8k cannot be guaranteed 'atomic' by the operating system. Hence, 32k blocks would break the transactions system. As Nathan remarks nearby, it's hard to tell how big a write can be assumed atomic, unless you have considerable knowledge of your OS and hardware. However, on traditional Unix filesystems (BSD-derived) it's a pretty certain bet that writes larger than 8K will *not* be atomic, since 8K is the filesystem block size. You don't even need any crash scenario to see why not: just consider running your disk down to zero free space. If there's one block left when you try to add a multi-block page to your table, you are left with a corrupted page, not an unwritten page. Not sure about the wild-and-wooly world of Linux filesystems... anybody know what the allocation unit is on the popular Linux FSes? My feeling is that 8K is an entirely reasonable size now that we have TOAST, and so there's no longer much interest in changing the default value of BLCKSZ. In theory, I think, WAL should reduce the importance of page writes being atomic --- but it still seems like a good idea to ensure that they are as atomic as we can make them. regards, tom lane
Re: [HACKERS] 8192 BLCKSZ ?
At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote: Well, true, but when you have 256 MB or a half-gig or more to devote to the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a pain for a lot of folks. Agreed. The other problem is that most people have 2-4MB of cache, so a 32k default would be too big for them. I've always been fine with the default, and in fact agree with it. The OpenACS project recommends a 16KB default for PG 7.0, but that's only so we can hold reasonable-sized lzText strings in forum tables, etc. I was only lamenting the fact that the world seems to have the impression that it's not a default, but rather a hard-wired limit. Agreed. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] 8192 BLCKSZ ?
On Mon, 27 Nov 2000, mlw wrote: This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. I think it is a pretty wild assumption to say that 32k is more efficient than 8k. Considering how blocks are used, 32k may be in fact quite a bit slower than 8k blocks. Tom