Re: [HACKERS] 8192 BLCKSZ ?]

2000-12-04 Thread Jan Wieck

Don Baccus wrote:

 ...
 I expect TOAST to work even better).  Users will still be able change to
 larger blocksizes (perhaps a wise thing to do if a large percentage of their
 data won't fit into a single PG block).   Users using the default will
 be able to store rows of *awesome* length, efficiently.

Depends...

Actually  the  toaster already jumps in if your tuples exceed
BLKSZ/4, so with the default of 8K blocks it  tries  to  keep
all tuples smaller than 2K. The reasons behind that are:

1.  An average tuple size of 8K means an average of 4K unused
space at the end of each block. Wasting  space  means  to
waste IO bandwidth.

2.  Since  big  items  are  unlikely  to  be search criteria,
needing to read them into memory for every  chech  for  a
match on other columns is a waste again.  So the more big
items are off from the main tuple, the smaller  the  main
table becomes, the more likely it is that the main tuples
(holding  the  keys)  are  cached  and  the   cheaper   a
sequential scan becomes.

Of  course,  especially  for  2. there is a break even point.
That is when the extra fetches to send toast  values  to  the
client  cost  more  than  there  was  saved from not doing it
during  the  main  scan  already.  A  full  table  SELECT   *
definitely  costs  more  if  TOAST  is involved. But who does
unqualified SELECT * from a multi-gig table without  problems
anyway?   Usually  you  pick  a single or a few based on some
other key attributes - don't you?

Let's make an example. You have a forum server that  displays
one  article  plus the date and sender of all follow-ups. The
article bodies are usually big (1-10K). So you do a SELECT  *
to  fetch  the actually displayed article, and another SELECT
sender, date_sent just to get the info for the follow-ups. If
we  assume a uniform distribution of body size and an average
of 10 follow-ups, that'd mean that we  save  52K  of  IO  and
cache usage for each article displayed.


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #





RE: [HACKERS] 8192 BLCKSZ ?]

2000-12-03 Thread Andrew Snow



 The cost difference between 32K vs 8K disk reads/writes are so small
 these days when compared with overall cost of the disk operation itself,
 that you can even measure it, well below 1%. Remember seek times
 advertised on disks are an average.

It has been said how small the difference is - therefore in my opinion it
should remain at 8KB to maintain best average performance with all existing
platforms.

I say its best let the OS and mass storage subsystem worry about read-ahead
caching and whether they actually read 8KB off the disk, or 32KB or 64KB
when we ask for 8.


- Andrew





Re: [Fwd: Re: [HACKERS] 8192 BLCKSZ ?]

2000-11-29 Thread mlw

Kevin O'Gorman wrote:
 
 mlw wrote:
 
  Tom Samplonius wrote:
 
   On Tue, 28 Nov 2000, mlw wrote:
  
Tom Samplonius wrote:

 On Mon, 27 Nov 2000, mlw wrote:

  This is just a curiosity.
 
  Why is the default postgres block size 8192? These days, with caching
  file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
  even gigabytes. Surely, 8K is inefficient.

   I think it is a pretty wild assumption to say that 32k is more efficient
 than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
 slower than 8k blocks.
   
I'm not so sure I agree. Perhaps I am off base here, but I did a bit of
OS profiling a while back when I was doing a DICOM server. I
experimented with block sizes and found that the best throughput on
Linux and Windows NT was at 32K. The graph I created showed a steady
increase in performance and a drop just after 32K, then steady from
there. In Windows NT it was more pronounced than it was in Linux, but
Linux still exhibited a similar trait.
  
 You are a bit off base here.  The typical access pattern is random IO,
   not sequentional.  If you use a large block size in Postgres, Postgres
   will read and write more data than necessary.  Which is faster? 1000 x 8K
   IOs?  Or 1000 x 32K IOs
 
  I can sort of see your point, but the  8K vs 32K is not a linear
  relationship.
  The big hit is the disk I/O operation, more so than just the data size.
  It may
  be almost as efficient to write 32K as it is to write 8K. While I do not
  know the
  exact numbers, and it varies by OS and disk subsystem,  I am sure that
  writing
  32K is not even close to 4x more expensive than 8K. Think about seek
  times,
  writing anything to the disk is expensive regardless of the amount of
  data. Most
  disks today have many heads, and are RL encoded. It may only add 10us
  (approx.
  1-2 sectors of a 64 sector drive spinning 7200 rpm)  to a disk operation
  which
  takes an order of magnitude longer positioning the heads.
 
  The overhead of an additional 24K is minute compared to the cost of a
  disk
  operation. So if any measurable benefit can come from having bigger
  buffers, i.e.
  having more data available per disk operation, it will probably be
  faster.
 
 This is only part of the story.  It applies best when you're going
 to use sequential scans, for instance, or otherwise use all the info
 in any block that you fetch.  However, when your blocks are 8x bigger,
 your number of blocks in the disk cache is 8x fewer.  If you're
 accessing random blocks, your hopes of finding the block in the
 cache are affected (probably not 8x, but there is an effect).
 
 So don't just blindly think that bigger blocks are better.  It
 ain't necessarily so.
 

First, the difference between 8K and 32K is 4 not 8.

The problem is you are looking at these numbers as if there is a linear
relationship between the 8 and the 32. You are thinking 8 is 1/4 the
size of 32, so it must be 1/4 the amount of work. This is not true at
all.

Many operating systems used a fixed memory block size allocation for
their disk cache. They do not allocate a new block for every disk
request, they maintain a pool of fixed sized buffer blocks. So if you
use fewer bytes than the OS block size you waste the difference between
your block size and the block size of the OS cache entry.

I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm
pretty confident that NT does as well from my previous tests.

So, in effect, an 8K block may waste 3/4 of the memory in the disk
cache.

 
http://www.mohawksoft.com



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-29 Thread mlw

Matthew Kirkwood wrote:
 
 On Tue, 28 Nov 2000, Tom Lane wrote:
 
  Nathan Myers [EMAIL PROTECTED] writes:
   In the event of a power outage, the drive will stop writing in
   mid-sector.
 
  Really?  Any competent drive firmware designer would've made sure that
  can't happen.  The drive has to detect power loss well before it
  actually loses control of its actuators, because it's got to move the
  heads to the safe landing zone.  If it checks for power loss and
  starts that shutdown process between sector writes, never in the
  middle of one, voila: atomic writes.
 
 In principle, that is correct.  However, the SGI XFS people
 have apparently found otherwise -- what can happen is that
 the drive itself has enough power to complete a write, but
 that the disk/controller buffers lose power and so you end
 up writing a (perhaps partial) block of zeroes.

I have worked on a few systems that intend to take a hard power failure
gracefully. It is a very hard thing to do, with a lot of specialized
circuitry.

While it is nice to think about, on a normal computer systems one can
not depend on a system shutting down gracefully on a hard power loss
without a smart UPS and daemon to shut down the system.

It does not matter one bit about disk write sizes or what ever. Unless
the computer can know it is about to lose power, it can not halt its
operations and enter a safe mode.

The whole "pull the plug" mentality is silly. Unless a system hardware
is specifically designed to manage this and proper software in place, it
can not be done, and any "compliance" you think you see is simply luck.

Any computer that has important data should have a smart UPS and a
daemon to manage it. 

-- 
http://www.mohawksoft.com



Re: [Fwd: Re: [HACKERS] 8192 BLCKSZ ?]

2000-11-29 Thread mlw

Kevin O'Gorman wrote:
 
 mlw wrote:
  Many operating systems used a fixed memory block size allocation for
  their disk cache. They do not allocate a new block for every disk
  request, they maintain a pool of fixed sized buffer blocks. So if you
  use fewer bytes than the OS block size you waste the difference between
  your block size and the block size of the OS cache entry.
 
  I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm
  pretty confident that NT does as well from my previous tests.
 
 I dunno about NT, but here's a quote from "Linux Kernel Internals"
 2nd Ed, page 92-93:
 .. The block size for any given device may be 512, 1024, 2048 or
 4096 bytes
 
 ... the buffer cache manages individual block buffers of
 varying size.  For this, every block is given a 'buffer_head' data
 structure. ...  The definition of the buffer head is in linux/fs.h
 
 ... the size of this area exactly matches the block size 'b_size'...
 
 The quote goes on to describe how the data structures are designed to
 be processor-cache-aware.
 

I double checked the kernel source, and you are right. I stand corrected
about the disk caching.

My assertion stands, it is a neglagable difference to read 32K vs 8K
from a disk, and the probability of data being within a 4 times larger
block is 4 times better, even though the probability of having the
correct block in memory is 4 times less. So, I don't think it is a
numerically significant issue.


-- 
http://www.mohawksoft.com



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-28 Thread Nathan Myers

On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote:
 "Christopher Kings-Lynne" [EMAIL PROTECTED] writes:
  I don't believe it's a performance issue, I believe it's that writes to
  blocks greater than 8k cannot be guaranteed 'atomic' by the operating
  system.  Hence, 32k blocks would break the transactions system.
 
 As Nathan remarks nearby, it's hard to tell how big a write can be
 assumed atomic, unless you have considerable knowledge of your OS and
 hardware.  

Not to harp on the subject, but even if you _do_ know a great deal
about your OS and hardware, you _still_ can't assume any write is
atomic.

To give an idea of what is involved, consider that modern disk 
drives routinely re-order writes, by themselves.  You think you
have asked for a sequential write of 8K bytes, or 16 sectors,
but the disk might write the first and last sectors first, and 
then the middle sectors in random order.  A block of all zeroes
might not be written at all, but just noted in the track metadata.

Most disks have a "feature" that they report the write complete
as soon as it is in the RAM cache, rather than after the sectors
are on the disk.  (It's a "feature" because it makes their
benchmarks come out better.)  It can usually be turned off, but 
different vendors have different ways to do it.  Have you turned
it off on your production drives?

In the event of a power outage, the drive will stop writing in
mid-sector.  If you're lucky, that sector would have a bad checksum
if you tried to read it.  If the half-written sector happens to 
contain track metadata, you might have a bigger problem.  


The short summary is: for power outage or OS-crash recovery purposes,
there is no such thing as atomicity.  This is why backups and 
transaction logs are important.

"Invest in a UPS."  Use a reliable OS, and operate it in a way that
doesn't stress it.  Even a well-built OS will behave oddly when 
resources are badly stressed.  (That the oddities may be documented
doesn't really help much.)

For performance purposes, it may be more or less efficient to group 
writes into 4K, 8K, or 32K chunks.  That's not a matter of database 
atomicity, but of I/O optimization.  It can only confuse people to 
use "atomicity" in that context.

Nathan Myers
[EMAIL PROTECTED]




Re: [HACKERS] 8192 BLCKSZ ?

2000-11-28 Thread Nathan Myers

On Tue, Nov 28, 2000 at 04:24:34PM -0500, Tom Lane wrote:
 Nathan Myers [EMAIL PROTECTED] writes:
  In the event of a power outage, the drive will stop writing in
  mid-sector.
 
 Really?  Any competent drive firmware designer would've made sure that
 can't happen.  The drive has to detect power loss well before it
 actually loses control of its actuators, because it's got to move
 the heads to the safe landing zone.  If it checks for power loss and
 starts that shutdown process between sector writes, never in the middle
 of one, voila: atomic writes.

I used to think that way too, because that's how I would design a drive.
(Anyway that would still only give you 512-byte-atomic writes, which 
isn't enough.)

Talking to people who build them was a rude awakening.  They have
apparatus to yank the head off the drive and lock it away when the 
power starts to go down, and it will happily operate in mid-write.
(It's possible that some drives are made the way Tom describes, but 
evidently not the commodity stuff.)

The level of software-development competence, and of reliability 
engineering, that I've seen among disk drive firmware maintainers
distresses me whenever I think about it.  A disk drive is best
considered as throwaway cache image of your real medium.

 Of course, there's still no guarantee if you get a hardware failure
 or sector write failure (recovery from the write failure might well
 take longer than the drive has got).  But guarding against a plain
 power-failure scenario is actually simpler than doing it the wrong
 way.

If only the disk-drive vendors (and buyers!) thought that way...

Nathan Myers
[EMAIL PROTECTED]




[Fwd: Re: [HACKERS] 8192 BLCKSZ ?]

2000-11-28 Thread mlw


Tom Samplonius wrote:

 On Tue, 28 Nov 2000, mlw wrote:

  Tom Samplonius wrote:
  
   On Mon, 27 Nov 2000, mlw wrote:
  
This is just a curiosity.
   
Why is the default postgres block size 8192? These days, with caching
file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
even gigabytes. Surely, 8K is inefficient.
  
 I think it is a pretty wild assumption to say that 32k is more efficient
   than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
   slower than 8k blocks.
 
  I'm not so sure I agree. Perhaps I am off base here, but I did a bit of
  OS profiling a while back when I was doing a DICOM server. I
  experimented with block sizes and found that the best throughput on
  Linux and Windows NT was at 32K. The graph I created showed a steady
  increase in performance and a drop just after 32K, then steady from
  there. In Windows NT it was more pronounced than it was in Linux, but
  Linux still exhibited a similar trait.

   You are a bit off base here.  The typical access pattern is random IO,
 not sequentional.  If you use a large block size in Postgres, Postgres
 will read and write more data than necessary.  Which is faster? 1000 x 8K
 IOs?  Or 1000 x 32K IOs

I can sort of see your point, but the  8K vs 32K is not a linear
relationship.
The big hit is the disk I/O operation, more so than just the data size. 
It may
be almost as efficient to write 32K as it is to write 8K. While I do not
know the
exact numbers, and it varies by OS and disk subsystem,  I am sure that
writing
32K is not even close to 4x more expensive than 8K. Think about seek
times,
writing anything to the disk is expensive regardless of the amount of
data. Most
disks today have many heads, and are RL encoded. It may only add 10us
(approx.
1-2 sectors of a 64 sector drive spinning 7200 rpm)  to a disk operation
which
takes an order of magnitude longer positioning the heads.

The overhead of an additional 24K is minute compared to the cost of a
disk
operation. So if any measurable benefit can come from having bigger
buffers, i.e.
having more data available per disk operation, it will probably be
faster.



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Mitch Vincent

I've been using a 32k BLCKSZ for months now without any trouble, though I've
not benchmarked it to see if it's any faster than one with a BLCKSZ of 8k..

-Mitch

 This is just a curiosity.

 Why is the default postgres block size 8192? These days, with caching
 file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
 even gigabytes. Surely, 8K is inefficient.

 Has anyone done any tests to see if a default 32K block would provide a
 better overall performance? 8K seems so small, and 32K looks to be where
 most x86 operating systems seem to have a sweet spot.

 If someone has the answer off the top of their head, and I'm just being
 stupid, let me have it. However, I have needed to up the block size to
 32K for a text management system and have seen no  performance problems.
 (It has not been a scientific experiment, admittedly.)

 This isn't a rant, but my gut tells me that a 32k block size as default
 would be better, and that smaller deployments should adjust down as
 needed.





Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Nathan Myers


Nothing is guaranteed for anything larger than 512 bytes, and even 
then you have maybe 1e-13 likelihood of a badly-written block written 
during a power outage going unnoticed.  (That is why the FAQ recommends
you invest in a UPS.)  If PG crashes, you're covered, regardless of 
block size.  If the OS crashes, you're not.  If the power goes out, 
you're not.

The block size affects how much is written when you change only a 
single record within a block.  When you update a two-byte field in a 
100-byte record, do you want to write 32k?  (The answer is "maybe".)

Nathan Myers
[EMAIL PROTECTED]

On Tue, Nov 28, 2000 at 09:14:15AM +0800, Christopher Kings-Lynne wrote:
 I don't believe it's a performance issue, I believe it's that writes to
 blocks greater than 8k cannot be guaranteed 'atomic' by the operating
 system.  Hence, 32k blocks would break the transactions system.  (Or
 something like that - am I correct?)
 
  From: [EMAIL PROTECTED] On Behalf Of Mitch Vincent
  Sent: Tuesday, November 28, 2000 8:40 AM
  Subject: Re: [HACKERS] 8192 BLCKSZ ?
 
  I've been using a 32k BLCKSZ for months now without any trouble,
  though I've
  not benchmarked it to see if it's any faster than one with a
  BLCKSZ of 8k..
 
   This is just a curiosity.
  
   Why is the default postgres block size 8192? These days, with caching
   file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
   even gigabytes. Surely, 8K is inefficient.
  
   Has anyone done any tests to see if a default 32K block would provide a
   better overall performance? 8K seems so small, and 32K looks to be where
   most x86 operating systems seem to have a sweet spot.
  
   If someone has the answer off the top of their head, and I'm just being
   stupid, let me have it. However, I have needed to up the block size to
   32K for a text management system and have seen no  performance problems.
   (It has not been a scientific experiment, admittedly.)
  
   This isn't a rant, but my gut tells me that a 32k block size as default
   would be better, and that smaller deployments should adjust down as
   needed.



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Don Baccus

At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote:
[ Charset ISO-8859-1 unsupported, converting... ]
 If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
 box this database is running on gets hit pretty hard and I haven't had a
 single ounce of trouble since I went to 7.0.X

Larger block sizes mean larger blocks in the cache, therefore fewer
blocks per megabyte.  The more granular the cache, the better.

Well, true, but when you have 256 MB or a half-gig or more to devote to
the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
pain for a lot of folks.

Though the entire discussion's moot with PG 7.1, with the removal of the
tuple-size limit, it has been unfortunate that the fact that a blocksize
of up to 32KB can easily be configured at build time hasn't been printed
in a flaming-red oversized font on the front page of www.postgresql.org.

THE ENTIRE WORLD seems to believe that PG suffers from a hard-wired 8KB
limit on tuple size, rather than simply defaulting to that limit.  When
I tell the heathens that the REAL limit is 32KB, they're surprised, amazed,
pleased etc.

This default has unfairly contributed to the poor reputation PG has suffered
from for so long due to widespread ignorance that it's only a default, easily
changed.

For instance the November Linux Journal has a column on PG, favorable but
mentions the 8KB limit as though it's absolute.  Tim Perdue's article on
PHP Builder implied the same when he spoke of PG 7.1 removing the limit.

Again, PG 7.1 removes the issue entirely, but it is ironic that so many
people had heard that PG suffered from a hard-wired 8KB limit on tuple
length...



- Don Baccus, Portland OR [EMAIL PROTECTED]
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Bruce Momjian

 At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote:
 [ Charset ISO-8859-1 unsupported, converting... ]
  If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
  box this database is running on gets hit pretty hard and I haven't had a
  single ounce of trouble since I went to 7.0.X
 
 Larger block sizes mean larger blocks in the cache, therefore fewer
 blocks per megabyte.  The more granular the cache, the better.
 
 Well, true, but when you have 256 MB or a half-gig or more to devote to
 the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
 pain for a lot of folks.

Agreed.  The other problem is that most people have 2-4MB of cache, so a
32k default would be too big for them.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Don Baccus

At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote:

 Well, true, but when you have 256 MB or a half-gig or more to devote to
 the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
 pain for a lot of folks.

Agreed.  The other problem is that most people have 2-4MB of cache, so a
32k default would be too big for them.

I've always been fine with the default, and in fact agree with it.  The
OpenACS project recommends a 16KB default for PG 7.0, but that's only so
we can hold reasonable-sized lzText strings in forum tables, etc.

I was only lamenting the fact that the world seems to have the impression
that it's not a default, but rather a hard-wired limit.



- Don Baccus, Portland OR [EMAIL PROTECTED]
  Nature photos, on-line guides, Pacific Northwest
  Rare Bird Alert Service and other goodies at
  http://donb.photo.net.



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Tom Lane

"Christopher Kings-Lynne" [EMAIL PROTECTED] writes:
 I don't believe it's a performance issue, I believe it's that writes to
 blocks greater than 8k cannot be guaranteed 'atomic' by the operating
 system.  Hence, 32k blocks would break the transactions system.

As Nathan remarks nearby, it's hard to tell how big a write can be
assumed atomic, unless you have considerable knowledge of your OS and
hardware.  However, on traditional Unix filesystems (BSD-derived) it's
a pretty certain bet that writes larger than 8K will *not* be atomic,
since 8K is the filesystem block size.  You don't even need any crash
scenario to see why not: just consider running your disk down to zero
free space.  If there's one block left when you try to add a
multi-block page to your table, you are left with a corrupted page,
not an unwritten page.

Not sure about the wild-and-wooly world of Linux filesystems...
anybody know what the allocation unit is on the popular Linux FSes?

My feeling is that 8K is an entirely reasonable size now that we have
TOAST, and so there's no longer much interest in changing the default
value of BLCKSZ.

In theory, I think, WAL should reduce the importance of page writes
being atomic --- but it still seems like a good idea to ensure that
they are as atomic as we can make them.

regards, tom lane



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Bruce Momjian

 At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote:
 
  Well, true, but when you have 256 MB or a half-gig or more to devote to
  the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
  pain for a lot of folks.
 
 Agreed.  The other problem is that most people have 2-4MB of cache, so a
 32k default would be too big for them.
 
 I've always been fine with the default, and in fact agree with it.  The
 OpenACS project recommends a 16KB default for PG 7.0, but that's only so
 we can hold reasonable-sized lzText strings in forum tables, etc.
 
 I was only lamenting the fact that the world seems to have the impression
 that it's not a default, but rather a hard-wired limit.

Agreed.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] 8192 BLCKSZ ?

2000-11-27 Thread Tom Samplonius


On Mon, 27 Nov 2000, mlw wrote:

 This is just a curiosity.
 
 Why is the default postgres block size 8192? These days, with caching
 file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
 even gigabytes. Surely, 8K is inefficient.

  I think it is a pretty wild assumption to say that 32k is more efficient
than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
slower than 8k blocks.


Tom