Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-14 Thread Jeffrey W. Baker
On Fri, 2005-06-24 at 09:37 -0400, Tom Lane wrote:
 ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  ... So I'll post the new results:
 
  checkpoint_ | writeback | 
  segments| cache | open_sync | fsync=false   | O_DIRECT only | 
  fsync_direct  | open_direct
  +---+---+---+---+---+--
  [3]   3 | off   |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  
  38.5(+ 0.9%) |  38.5(+ 0.9%)
 
 Yeah, this is about what I was afraid of: if you're actually fsyncing
 then you get at best one commit per disk revolution, and the negotiation
 with the OS is down in the noise.
 
 At this point I'm inclined to reject the patch on the grounds that it
 adds complexity and portability issues, without actually buying any
 useful performance improvement.  The write-cache-on numbers are not
 going to be interesting to any serious user :-(

You mean not interesting to people without a UPS.  Personally, I'd like
to realize a 50% boost in tps, which is what O_DIRECT buys according to
ITAGAKI Takahiro's posted results.

The batteries on a caching RAID controller can run for days at a
stretch.  It's not as dangerous as people make it sound.  And anyone
running PG on software RAID is crazy.

-jwb

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-14 Thread Jeffrey W. Baker
On Fri, 2005-06-24 at 10:19 -0500, Jim C. Nasby wrote:
 On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
  ITAGAKI Takahiro [EMAIL PROTECTED] writes:
   ... So I'll post the new results:
  
   checkpoint_ | writeback | 
   segments| cache | open_sync | fsync=false   | O_DIRECT only | 
   fsync_direct  | open_direct
   +---+---+---+---+---+--
   [3]   3 | off   |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  
   38.5(+ 0.9%) |  38.5(+ 0.9%)
  
  Yeah, this is about what I was afraid of: if you're actually fsyncing
  then you get at best one commit per disk revolution, and the negotiation
  with the OS is down in the noise.
  
  At this point I'm inclined to reject the patch on the grounds that it
  adds complexity and portability issues, without actually buying any
  useful performance improvement.  The write-cache-on numbers are not
  going to be interesting to any serious user :-(
 
 Is there anyone with a battery-backed RAID controller that could run
 these tests? I suspect that in that case the differences might be closer
 to 1 or 2 rather than 3, which would make the patch much more valuable.

I applied the O_DIRECT patch to 8.0.3 and I tested this on a
battery-backed RAID controller with 128MB of cache and 5 7200RPM SATA
disks.  All caches are write-back.  The xlog and data are on the same
JFS volume.  pgbench was run with a scale factor of 1000 and 10
total transactions.  Clients varied from 10 to 100.


Clients  |  fsync   |   open_direct

   10|81|98 (+21%)
  100|   100|   105 ( +5%)


No problems were experienced.  The patch seems to give a useful boost!

-jwb

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-14 Thread Greg Stark

Jeffrey W. Baker [EMAIL PROTECTED] writes:

 The batteries on a caching RAID controller can run for days at a
 stretch.  It's not as dangerous as people make it sound.  And anyone
 running PG on software RAID is crazy.

Get back to us after your first hardware failure when your vendor says the
power supply you need is on backorder and won't be available for 48 hours...

(And what's your problem with software raid anyways?)

-- 
greg


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-14 Thread Joshua D. Drake

Greg Stark wrote:

Jeffrey W. Baker [EMAIL PROTECTED] writes:



The batteries on a caching RAID controller can run for days at a
stretch.  It's not as dangerous as people make it sound.  And anyone
running PG on software RAID is crazy.



Get back to us after your first hardware failure when your vendor says the
power supply you need is on backorder and won't be available for 48 hours...

(And what's your problem with software raid anyways?)


I would have to second that. Software raid works just fine.

Sincerely,

Joshua D. Drake







--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-06 Thread Mark Wong
On Fri, 24 Jun 2005 09:21:56 -0700
Josh Berkus josh@agliodbs.com wrote:

 Jim,
 
  Josh, is this something that could be done in the performance lab?
 
 That's the idea.   Sadly, OSDL's hardware has been having critical failures 
 of 
 late (I'm still trying to get test results on the checkpointing thing) and 
 the GreenPlum machines aren't up yet.

I'm on the verge of having a 4-way opteron system with 4 Adaptec 2200s
scsi controllers attached to eight 10-disk 36GB arrays ready.  I believe
there are software tools that'll let you reconfigure the luns from linux
so you wouldn't need physical access.  Anyone want time on the system?

Mark

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-07-02 Thread Bruce Momjian

These patches will require some refactoring and documentation, but I
will do that when I apply it.

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---


ITAGAKI Takahiro wrote:
 Tom Lane [EMAIL PROTECTED] wrote:
 
  Yeah, this is about what I was afraid of: if you're actually fsyncing
  then you get at best one commit per disk revolution, and the negotiation
  with the OS is down in the noise.
 
 If we disable writeback-cache and use open_sync, the per-page writing
 behavior in WAL module will show up as bad result. O_DIRECT is similar
 to O_DSYNC (at least on linux), so that the benefit of it will disappear
 behind the slow disk revolution.
 
 In the current source, WAL is written as:
 for (i = 0; i  N; i++) { write(buffers[i], BLCKSZ); }
 Is this intentional? Can we rewrite it as follows?
write(buffers[0], N * BLCKSZ);
 
 In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
 Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
 These two patches are independent, so they can be applied either or both.
 
 
 I tested them on my machine and the results as follows. It shows that
 direct-io and gather-write is the best choice when writeback-cache is off.
 Are these two patches worth trying if they are used together?
 
 
 | writeback | fsync= | fdata | open_ | fsync_ | open_ 
 patch   | cache |  false |  sync |  sync | direct | direct
 +---++---+---++-
 direct io   | off   |  124.2 | 105.7 |  48.3 |   48.3 |  48.2 
 direct io   | on|  129.1 | 112.3 | 114.1 |  142.9 | 144.5 
 gather-write| off   |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A) 
 both| off   |  131.5 | 115.5 | 114.4 |  145.4 | 145.2 
 
 - 20runs * pgbench -s 100 -c 50 -t 200
- with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
 - using 2 ATA disks:
- hda(reiserfs) includes system and wal.
- hdc(jfs) includes database files. writeback-cache is always on.
 
 ---
 ITAGAKI Takahiro
 NTT Cyber Space Laboratories
 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-28 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote:

 Yeah, this is about what I was afraid of: if you're actually fsyncing
 then you get at best one commit per disk revolution, and the negotiation
 with the OS is down in the noise.

If we disable writeback-cache and use open_sync, the per-page writing
behavior in WAL module will show up as bad result. O_DIRECT is similar
to O_DSYNC (at least on linux), so that the benefit of it will disappear
behind the slow disk revolution.

In the current source, WAL is written as:
for (i = 0; i  N; i++) { write(buffers[i], BLCKSZ); }
Is this intentional? Can we rewrite it as follows?
   write(buffers[0], N * BLCKSZ);

In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
These two patches are independent, so they can be applied either or both.


I tested them on my machine and the results as follows. It shows that
direct-io and gather-write is the best choice when writeback-cache is off.
Are these two patches worth trying if they are used together?


| writeback | fsync= | fdata | open_ | fsync_ | open_ 
patch   | cache |  false |  sync |  sync | direct | direct
+---++---+---++-
direct io   | off   |  124.2 | 105.7 |  48.3 |   48.3 |  48.2 
direct io   | on|  129.1 | 112.3 | 114.1 |  142.9 | 144.5 
gather-write| off   |  124.3 | 108.7 | 105.4 |  (N/A) | (N/A) 
both| off   |  131.5 | 115.5 | 114.4 |  145.4 | 145.2 

- 20runs * pgbench -s 100 -c 50 -t 200
   - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
- using 2 ATA disks:
   - hda(reiserfs) includes system and wal.
   - hdc(jfs) includes database files. writeback-cache is always on.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



xlog.dio.diff
Description: Binary data


xlog.gw.diff
Description: Binary data

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-24 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 ... So I'll post the new results:

 checkpoint_ | writeback | 
 segments| cache | open_sync | fsync=false   | O_DIRECT only | 
 fsync_direct  | open_direct
 +---+---+---+---+---+--
 [3]   3 | off   |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 
 0.9%) |  38.5(+ 0.9%)

Yeah, this is about what I was afraid of: if you're actually fsyncing
then you get at best one commit per disk revolution, and the negotiation
with the OS is down in the noise.

At this point I'm inclined to reject the patch on the grounds that it
adds complexity and portability issues, without actually buying any
useful performance improvement.  The write-cache-on numbers are not
going to be interesting to any serious user :-(

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-24 Thread Jim C. Nasby
On Fri, Jun 24, 2005 at 09:37:23AM -0400, Tom Lane wrote:
 ITAGAKI Takahiro [EMAIL PROTECTED] writes:
  ... So I'll post the new results:
 
  checkpoint_ | writeback | 
  segments| cache | open_sync | fsync=false   | O_DIRECT only | 
  fsync_direct  | open_direct
  +---+---+---+---+---+--
  [3]   3 | off   |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  
  38.5(+ 0.9%) |  38.5(+ 0.9%)
 
 Yeah, this is about what I was afraid of: if you're actually fsyncing
 then you get at best one commit per disk revolution, and the negotiation
 with the OS is down in the noise.
 
 At this point I'm inclined to reject the patch on the grounds that it
 adds complexity and portability issues, without actually buying any
 useful performance improvement.  The write-cache-on numbers are not
 going to be interesting to any serious user :-(

Is there anyone with a battery-backed RAID controller that could run
these tests? I suspect that in that case the differences might be closer
to 1 or 2 rather than 3, which would make the patch much more valuable.

Josh, is this something that could be done in the performance lab?
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-24 Thread Josh Berkus
Jim,

 Josh, is this something that could be done in the performance lab?

That's the idea.   Sadly, OSDL's hardware has been having critical failures of 
late (I'm still trying to get test results on the checkpointing thing) and 
the GreenPlum machines aren't up yet.

I need to contact those folks in Brazil ...

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread Jim C. Nasby
On Wed, Jun 22, 2005 at 03:50:04PM -0400, Tom Lane wrote:
 The reason I question automatic is that you really want to test each
 drive being used, if the system has more than one; but Postgres has no
 idea what the actual hardware layout is, and so no good way to know what
 needs to be tested.

Would testing in the WAL directory be sufficient? Or at least better
than nothing? Of course we could test in the database directories as
well, but you never know if stuff's been symlinked elsewhere... err, we
can test for that, no?

In any case, it seems like it'd be good to try to test and throw a
warning if the drive appears to be caching or if we think the test might
not cover everything (ie symlinks in the data directory).
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread Douglas McNaught
Jim C. Nasby [EMAIL PROTECTED] writes:

 Would testing in the WAL directory be sufficient? Or at least better
 than nothing? Of course we could test in the database directories as
 well, but you never know if stuff's been symlinked elsewhere... err, we
 can test for that, no?

 In any case, it seems like it'd be good to try to test and throw a
 warning if the drive appears to be caching or if we think the test might
 not cover everything (ie symlinks in the data directory).

I think it would make more sense to write the test as a separate
utility program--then the sysadmin can check the disks he cares
about.  I don't personally see the need to burden the backend with
this.

-Doug

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread Bruce Momjian
Tom Lane wrote:
 Greg Stark [EMAIL PROTECTED] writes:
  Tom Lane [EMAIL PROTECTED] writes:
  Unfortunately, I cannot believe these numbers --- the near equality of
  fsync off and fsync on means there is something very wrong with the
  measurements.  What I suspect is that your ATA drives are doing write
  caching and thus the fsyncs are not really waiting for I/O at all.
 
  I wonder whether it would make sense to have an automatic test for this
  problem. I suspect there are lots of installations out there whose admins
  don't realize that their hardware is doing this to them.
 
 Not sure about automatic, but a simple little test program to measure
 the speed of rewriting/fsyncing a small test file would surely be a nice
 thing to have.
 
 The reason I question automatic is that you really want to test each
 drive being used, if the system has more than one; but Postgres has no
 idea what the actual hardware layout is, and so no good way to know what
 needs to be tested.

Some folks have battery-backed cached controllers so they would appear
as not handling fsync when in fact they do.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 Tom Lane wrote:
 The reason I question automatic is that you really want to test each
 drive being used, if the system has more than one; but Postgres has no
 idea what the actual hardware layout is, and so no good way to know what
 needs to be tested.

 Some folks have battery-backed cached controllers so they would appear
 as not handling fsync when in fact they do.

Right, so something like refusing to start if we think fsync doesn't
work is probably not a hot idea.  (Unless you want to provide a GUC
variable to override it...)

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread Bruce Momjian
Tom Lane wrote:
 Gavin Sherry [EMAIL PROTECTED] writes:
  Curt Sampson [EMAIL PROTECTED] writes:
  But is it really a problem? I somewhere got the impression that some
  drives, on power failure, will be able to keep going for long enough to
  write out the cache and park the heads anyway. If so, the drive is still
  guaranteeing the write.
 
  I've seen discussion about disks behaving this way. There's no magic:
  they're battery backed.
 
 Oh, sure, then it's easy ;-)
 
 The bottom line here seems to be the same as always: you can't run an
 industrial strength database on piece-of-junk consumer grade hardware.
 Our problem is that because the software is free, people expect to run
 it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
 they blame us when they don't get the same results as the guy running
 Oracle on million-dollar triply-redundant server hardware.  Oh well.

At least we have an FAQ on this:

H3A name=3.73.7/A) What computer hardware should I use?/H3

PBecause PC hardware is mostly compatible, people tend to believe that
all PC hardware is of equal quality.  It is not.  ECC RAM, SCSI, and
quality motherboards are more reliable and have better performance than
less expensive hardware.  PostgreSQL will run on almost any hardware,
but if reliability and performance are important it is wise to
research your hardware options thoroughly.  Our email lists can be used
to discuss hardware options and tradeoffs./P



-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-23 Thread ITAGAKI Takahiro
Tom Lane [EMAIL PROTECTED] wrote:

 Unfortunately, I cannot believe these numbers --- the near equality of
 fsync off and fsync on means there is something very wrong with the
 measurements.  What I suspect is that your ATA drives are doing write
 caching and thus the fsyncs are not really waiting for I/O at all.

I think direct io and writeback-cache should be considered separate issues.
I guess that direct-io can make OSes not to cache WAL files and they will
use more memory to cache data files.

In my previous test, I had enabled writeback-cache of my drives
because of performance. But I understand that the cache should be
disabled for reliable writes from the discussion.
Also my checkpoint_segments setting might be too large against
the default. So I'll post the new results:

checkpoint_ | writeback | 
segments| cache | open_sync | fsync=false   | O_DIRECT only | 
fsync_direct  | open_direct
+---+---+---+---+---+--
[1]  48 | on| 109.3 tps | 125.1(+ 11.4%)| 157.3(+44.0%) | 
160.4(+46.8%) | 161.1(+47.5%)
[2]   3 | on| 102.5 tps | 136.3(+ 33.0%)| 117.6(+14.7%) |   
| 
[3]   3 | off   |  38.2 tps | 138.8(+263.5%)|  38.6(+ 1.2%) |  38.5(+ 
0.9%) |  38.5(+ 0.9%)

- 30runs * pgbench -s 100 -c 10 -t 1000
- using 2 ATA disks:
   - hda(reiserfs) includes system and wal. writeback-cache is on at [1][2] and 
off at [3].
   - hdc(jfs) includes database files. writeback-cache is always on.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Greg Stark
Tom Lane [EMAIL PROTECTED] writes:

 Unfortunately, I cannot believe these numbers --- the near equality of
 fsync off and fsync on means there is something very wrong with the
 measurements.  What I suspect is that your ATA drives are doing write
 caching and thus the fsyncs are not really waiting for I/O at all.

I wonder whether it would make sense to have an automatic test for this
problem. I suspect there are lots of installations out there whose admins
don't realize that their hardware is doing this to them.

It shouldn't be too hard to test a few hundred or even a few thousand fsyncs
and calculate the seek time. If it implies a rotational speed over 15kRPM then
you know the drive is lying and the data storage is unreliable.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Unfortunately, I cannot believe these numbers --- the near equality of
 fsync off and fsync on means there is something very wrong with the
 measurements.  What I suspect is that your ATA drives are doing write
 caching and thus the fsyncs are not really waiting for I/O at all.

 I wonder whether it would make sense to have an automatic test for this
 problem. I suspect there are lots of installations out there whose admins
 don't realize that their hardware is doing this to them.

Not sure about automatic, but a simple little test program to measure
the speed of rewriting/fsyncing a small test file would surely be a nice
thing to have.

The reason I question automatic is that you really want to test each
drive being used, if the system has more than one; but Postgres has no
idea what the actual hardware layout is, and so no good way to know what
needs to be tested.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Curt Sampson

On Thu, 22 Jun 2005, Greg Stark wrote:


Tom Lane [EMAIL PROTECTED] writes:


Unfortunately, I cannot believe these numbers --- the near equality of
fsync off and fsync on means there is something very wrong with the
measurements.  What I suspect is that your ATA drives are doing write
caching and thus the fsyncs are not really waiting for I/O at all.


I wonder whether it would make sense to have an automatic test for this
problem. I suspect there are lots of installations out there whose admins
don't realize that their hardware is doing this to them.


But is it really a problem? I somewhere got the impression that some
drives, on power failure, will be able to keep going for long enough to
write out the cache and park the heads anyway. If so, the drive is still
guaranteeing the write.

But regardless, perhaps we can add some stuff to the various OSes'
startup scripts that could help with this. For example, in NetBSD you
can dkctl device setcache r for most any disk device (certainly all
SCSI and ATA) to enable the read cache and disable the write cache.

cjs
--
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Tom Lane
Curt Sampson [EMAIL PROTECTED] writes:
 But regardless, perhaps we can add some stuff to the various OSes'
 startup scripts that could help with this. For example, in NetBSD you
 can dkctl device setcache r for most any disk device (certainly all
 SCSI and ATA) to enable the read cache and disable the write cache.

[ shudder ]  I can see the complaints now: Merely starting up Postgres
cut my overall system performance by a factor of 10!  I wasn't even
using it!!  What a piece of junk!!!  I can hardly think of a better
way to drive away people with a marginal interest in the database...

This can *not* be default behavior, and unfortunately that limits its
value quite a lot.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Curt Sampson

On Wed, 22 Jun 2005, Tom Lane wrote:


[ shudder ]  I can see the complaints now: Merely starting up Postgres
cut my overall system performance by a factor of 10!


Yeah, quite the scenario.


This can *not* be default behavior, and unfortunately that limits its
value quite a lot.


Indeed. Maybe it's best just to document this stuff for the various
OSes, and let the admins deal with configuring their machines.

But you know, it might be a reasonable option switch, or something.

cjs
--
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Tom Lane
[ on the other point... ]

Curt Sampson [EMAIL PROTECTED] writes:
 But is it really a problem? I somewhere got the impression that some
 drives, on power failure, will be able to keep going for long enough to
 write out the cache and park the heads anyway. If so, the drive is still
 guaranteeing the write.

If the drives worked that way, we'd not be seeing any problem, but we do
see problems.  Without having a whole lot of data to back it up, I would
think that keeping the platter spinning is no problem (sheer rotational
inertia) but seeking to a lot of new tracks to write randomly-positioned
dirty sectors would require significant energy that just ain't there
once the power drops.  I seem to recall reading that the seek actuators
eat the largest share of power in a running drive...

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Gavin Sherry
On Thu, 23 Jun 2005, Tom Lane wrote:

 [ on the other point... ]

 Curt Sampson [EMAIL PROTECTED] writes:
  But is it really a problem? I somewhere got the impression that some
  drives, on power failure, will be able to keep going for long enough to
  write out the cache and park the heads anyway. If so, the drive is still
  guaranteeing the write.

 If the drives worked that way, we'd not be seeing any problem, but we do
 see problems.  Without having a whole lot of data to back it up, I would
 think that keeping the platter spinning is no problem (sheer rotational
 inertia) but seeking to a lot of new tracks to write randomly-positioned
 dirty sectors would require significant energy that just ain't there
 once the power drops.  I seem to recall reading that the seek actuators
 eat the largest share of power in a running drive...

I've seen discussion about disks behaving this way. There's no magic:
they're battery backed.

Thanks,

Gavin

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Gregory Maxwell
On 6/23/05, Gavin Sherry [EMAIL PROTECTED] wrote:

  inertia) but seeking to a lot of new tracks to write randomly-positioned
  dirty sectors would require significant energy that just ain't there
  once the power drops.  I seem to recall reading that the seek actuators
  eat the largest share of power in a running drive...
 
 I've seen discussion about disks behaving this way. There's no magic:
 they're battery backed.

Nah this isn't always the case, for example some of the IBM deskstars
had a few tracks at the start of the disk reserved.. if the power
failed the head retracted all the way and used the rotational energy
to power it long enough to write out the cache..  At start the drive
would read it back in and finish flushing it.

 unfortunately firmware bugs made it not always wait until the
head returned to the start to begin writing...

I'm not sure what other drives do this (er, well do it correctly :) ).

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Tom Lane
Gavin Sherry [EMAIL PROTECTED] writes:
 Curt Sampson [EMAIL PROTECTED] writes:
 But is it really a problem? I somewhere got the impression that some
 drives, on power failure, will be able to keep going for long enough to
 write out the cache and park the heads anyway. If so, the drive is still
 guaranteeing the write.

 I've seen discussion about disks behaving this way. There's no magic:
 they're battery backed.

Oh, sure, then it's easy ;-)

The bottom line here seems to be the same as always: you can't run an
industrial strength database on piece-of-junk consumer grade hardware.
Our problem is that because the software is free, people expect to run
it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
they blame us when they don't get the same results as the guy running
Oracle on million-dollar triply-redundant server hardware.  Oh well.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Gavin Sherry
On Thu, 23 Jun 2005, Tom Lane wrote:

 Gavin Sherry [EMAIL PROTECTED] writes:
  Curt Sampson [EMAIL PROTECTED] writes:
  But is it really a problem? I somewhere got the impression that some
  drives, on power failure, will be able to keep going for long enough to
  write out the cache and park the heads anyway. If so, the drive is still
  guaranteeing the write.

  I've seen discussion about disks behaving this way. There's no magic:
  they're battery backed.

 Oh, sure, then it's easy ;-)

 The bottom line here seems to be the same as always: you can't run an
 industrial strength database on piece-of-junk consumer grade hardware.
 Our problem is that because the software is free, people expect to run
 it on bottom-of-the-line Joe Bob's Bait And PC Shack hardware, and then
 they blame us when they don't get the same results as the guy running
 Oracle on million-dollar triply-redundant server hardware.  Oh well.

If you ever need a second job, I recommend stand up comedy :-).

Gavin

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-22 Thread Curt Sampson

On Thu, 23 Jun 2005, Tom Lane wrote:


The bottom line here seems to be the same as always: you can't run an
industrial strength database on piece-of-junk consumer grade hardware.


Sure you can, though it may take several bits of piece-of-junk
consumer-grade hardware. It's far more about how you set up your system
and implement recovery policies than it is about hardware.

I ran an ISP back in the '90s on old PC junk, and we had far better
uptime than most of our competitors running on expensive Sun gear. One
ISP was completely out for half a day because the tech. guy bent and
broke a hot-swappable circuit board while installing it, bringing down
the entire machine. (Pretty dumb of them to be running everything on a
single, irreplacable high-availablity system.)


...they blame us when they don't get the same results as the guy
running Oracle on...


Now that phrase irritates me a bit. I've been using all this stuff for
a long time (Postgres on and off since QUEL, before SQL was dropped
in instead) and at this point, for the (perhaps slim) majority of
applications, I would say that PostgreSQL is a better database than
Oracle. It requires much, much less effort to get a system and its test
framework up and running under PostgreSQL than it does under Oracle,
PostgreSQL has far fewer stupid limitations, and in other areas, such
as performance, it competes reasonably well in a lot of cases. It's a
pretty impressive piece of work, thanks in large part to efforts put in
over the last few years.

cjs
--
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-21 Thread Tom Lane
ITAGAKI Takahiro [EMAIL PROTECTED] writes:
 I tested two combinations,
   - fsync_direct: O_DIRECT+fsync()
   - open_direct: O_DIRECT+O_SYNC
 to compare them with O_DIRECT on my linux machine.
 The pgbench results still shows a performance win:

 scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct | 
 open_direct
 -++---+--+--+--+---
   10 |  150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 
 253.3(+ 0.3%)
  100 |  1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 
 150.8(+46.8%)
 60runs * pgbench -c 10 -t 1000
 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

Unfortunately, I cannot believe these numbers --- the near equality of
fsync off and fsync on means there is something very wrong with the
measurements.  What I suspect is that your ATA drives are doing write
caching and thus the fsyncs are not really waiting for I/O at all.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-21 Thread Josh Berkus
Takahiro,

 scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct |
 open_direct
 -++---+--+--+--+
--- 10 |  150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)|
 253.6(+ 0.4%)| 253.3(+ 0.3%) 100 |  1.5GB | 102.7 tps | 117.8(+14.7%)|
 147.6(+43.7%)| 148.9(+45.0%)| 150.8(+46.8%) 60runs * pgbench -c 10 -t
 1000
 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

This looks pretty good.   I'd like to try it out on some of our tests.   
Will get back to you on this, but it looks  to me like the O_DIRECT 
results are good enough to consider accepting the patch.

What filesystem and mount options did you use for this test?

 - Are both fsync_direct and open_direct necessary?
 MySQL seems to use only O_DIRECT+fsync() combination.

MySQL doesn't support as many operating systems as we do.   What OSes and 
versions will support O_DIRECT?


-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [PATCHES] O_DIRECT for WAL writes

2005-06-20 Thread ITAGAKI Takahiro
Hi all,
O_DIRECT for WAL writes was discussed at
http://archives.postgresql.org/pgsql-patches/2005-06/msg00064.php
but I have some items that want to be discussed, so I would like to
re-post it to HACKERS.


Bruce Momjian pgman@candle.pha.pa.us wrote:

 I think the conclusion from the discussion is that O_DIRECT is in
 addition to the sync method, rather than in place of it, because
 O_DIRECT doesn't have the same media write guarantees as fsync().  Would
 you update the patch to do and see if there is a performance win?

I tested two combinations,
  - fsync_direct: O_DIRECT+fsync()
  - open_direct: O_DIRECT+O_SYNC
to compare them with O_DIRECT on my linux machine.
The pgbench results still shows a performance win:

scale| DBsize | open_sync | fsync=false  | O_DIRECT only| fsync_direct | 
open_direct
-++---+--+--+--+---
  10 |  150MB | 252.6 tps | 263.5(+ 4.3%)| 253.4(+ 0.3%)| 253.6(+ 0.4%)| 
253.3(+ 0.3%)
 100 |  1.5GB | 102.7 tps | 117.8(+14.7%)| 147.6(+43.7%)| 148.9(+45.0%)| 
150.8(+46.8%)
60runs * pgbench -c 10 -t 1000
on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8

O_DIRECT, fsync_direct and open_direct show the same tendency of performance.
There were a win on scale=100, but no win on scale=10, which is a fully
in-memory benchmark.

The following items still want to be discussed:
- Are their names appropriate?
Simplify to 'direct'?
- Are both fsync_direct and open_direct necessary?
MySQL seems to use only O_DIRECT+fsync() combination.
- Is it ok to set the dio buffer alignment to BLCKSZ?
This is simple way to set the alignment to match many environment.
If it is not enough, BLCKSZ would be also a problem for direct io.



BTW, IMHO the major benefit of direct io is saving memory. O_DIRECT gives
a hint that OS should not cache WAL files. Without direct io, OS might make
a effort to cache WAL files, which will never be used, and might discard
data file cache.

---
ITAGAKI Takahiro
NTT Cyber Space Laboratories



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match