Re: [PERFORM] vacuum performance

2004-03-18 Thread Tom Lane
Alan Stange <[EMAIL PROTECTED]> writes:
> I do a truss on the process and see the output below looping over and 
> over.   Note the constant opening and closing of the file 42064889.3.
> Why the open/close cycle as opposed to caching the file descriptor 
> somewhere?

This is probably a "blind write".  We've gotten rid of those for 7.5.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] vacuum performance

2004-03-18 Thread Alan Stange
Hello all,

I have a question/observation about vacuum performance.  I'm running 
Solaris 9, pg 7.4.1.
The process in questions is doing a vacuum:

bash-2.05$ /usr/ucb/ps auxww | grep 4885
fiasco4885 19.1  3.7605896592920 ?O 19:29:44 91:38 postgres: 
fiasco fiasco [local] VACUUM

I do a truss on the process and see the output below looping over and 
over.   Note the constant opening and closing of the file 42064889.3.

Why the open/close cycle as opposed to caching the file descriptor 
somewhere?

If PG really does need to loop like this, it should be much faster to 
set the cwd and then open without the path in the file name.  You're 
forcing the kernel to do a lot of work walking the path, checking for 
nfs mounts, symlinks, etc.

Thanks!

-- Alan

open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x18F6E000, SEEK_SET)= 0x18F6E000
write(47, "\0\0\0 zA9A3D9E8\0\0\0 "".., 8192)   = 8192
close(47)   = 0
read(29, "\0\0\0 }ED WF1B0\0\0\0 $".., 8192)= 8192
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x18F78000, SEEK_SET)= 0x18F78000
write(47, "\0\0\0 zA9AC 090\0\0\0 "".., 8192)   = 8192
close(47)   = 0
llseek(43, 0x26202000, SEEK_SET)= 0x26202000
read(43, "\0\0\084 EC9FC P\0\0\0 )".., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x18F62000, SEEK_SET)= 0x18F62000
write(47, "\0\0\0 zA9C2\bB8\0\0\0 "".., 8192)   = 8192
close(47)   = 0
read(29, "\0\0\0 }ED X1210\0\0\0 $".., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x18018000, SEEK_SET)= 0x18018000
write(47, "\0\0\0 zA997ADB0\0\0\0 "".., 8192)   = 8192
close(47)   = 0
llseek(43, 0x2620, SEEK_SET)= 0x2620
read(43, "\0\0\084 EC4F5E8\0\0\0 )".., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
llseek(13, 13918208, SEEK_SET)  = 13918208
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x18F52000, SEEK_SET)= 0x18F52000
write(47, "\0\0\0 zABE7 V10\0\0\0 "".., 8192)   = 8192
close(47)   = 0
semop(46, 0xFFBFC5D0, 1)= 0
read(29, "\0\0\0 }ED X 2 p\0\0\0 $".., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
llseek(43, 0x270DA000, SEEK_SET)= 0x270DA000
write(43, "\0\0\087A2E8 #B8\0\0\0 )".., 8192)   = 8192
llseek(43, 0x261FE000, SEEK_SET)= 0x261FE000
read(43, "\0\0\084 EC498\0\0\0\0 )".., 8192)= 8192
poll(0xFFBFC100, 0, 10) = 0
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
llseek(47, 0x1804A000, SEEK_SET)= 0x1804A000
write(47, "\0\0\0 zAA0F8DE0\0\0\0 "".., 8192)   = 8192
close(47)   = 0
read(29, "\0\0\0 }ED X RD0\0\0\0 $".., 8192)= 8192
semop(52, 0xFFBFC5E0, 1)= 0
semop(52, 0xFFBFC640, 1)= 0
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
write(13, "D0 Z\001\0\0\0 )\0\0\087".., 8192)   = 8192
open64("/export/nst1/fi/pg/data1/base/91488/42064889.3", O_RDWR) = 47
---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Tom Lane wrote:
> > It really just shows whether the fsync fater the close has similar
> > timing to the one before the close.  That was the best way I could think
> > to test it.
> 
> Sure, but where's the "separate process" part?  What this seems to test
> is whether a single process can sync its own writes through a different
> file descriptor; which is interesting but by no means the only thing we
> need to be sure of if we want to make the bgwriter handle syncing.

I am not sure how to easily test if a separate process can do the same. 
I am sure it can be done, but for me it was enough to see that it works
in a single process.  Unix isn't very process-centered for I/O, so I
don't think it would make much of a difference.  Now, Win32, that might
be an issue.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] severe performance issue with planner (fwd)

2004-03-18 Thread Eric Brown
I also tried this (printf1 in irc suggested "folding" the joins) :

SELECT
w8.wid,
w8.variant,
w8.num_variants,
sum_text(w8.unicode) as unicodes,
sum_text(w8.pinyin) as pinyins
FROM
(words as w8 JOIN
(words as w7 JOIN
(words as w6 JOIN
(words as w5 JOIN
(words as w4 JOIN
(words as w3 JOIN
(words as w2 JOIN
(words as w0 JOIN words as w1
ON(w1.wid = w0.wid AND w1.variant = w0.variant AND w1.sequence = w0.sequence 
+ 1 AND w1.pinyin LIKE 'fu_'))
ON(w2.wid = w1.wid AND w2.variant = w1.variant AND w2.sequence = w1.sequence 
+ 1 AND w2.pinyin LIKE 'ji_'))
ON(w3.wid = w2.wid AND w3.variant = w2.variant AND w3.sequence = w2.sequence 
+ 1 AND w3.pinyin LIKE 'guan_'))
ON(w4.wid = w3.wid AND w4.variant = w3.variant AND w4.sequence = w3.sequence 
+ 1 AND w4.pinyin LIKE 'kai_'))
ON(w5.wid = w4.wid AND w5.variant = w4.variant AND w5.sequence = w4.sequence 
+ 1 AND w5.pinyin LIKE 'fang_'))
ON(w6.wid = w5.wid AND w6.variant = w5.variant AND w6.sequence = w5.sequence 
+ 1 AND w6.pinyin LIKE 'xi_'))
ON(w7.wid = w6.wid AND w7.variant = w6.variant AND w7.sequence = w6.sequence 
+ 1 AND w7.pinyin LIKE 'tong_'))
ON(w8.wid = w7.wid AND w8.variant = w7.variant))

WHERE
w0.wid > 0 AND
w0.pinyin = 'zheng4' AND
w0.def_exists = 't' AND
w0.sequence = 0
GROUP BY
w8.wid,
w8.variant,
w8.num_variants,
w8.page_order,
w0.sequence ,
w1.sequence ,
w2.sequence ,
w3.sequence ,
w4.sequence ,
w5.sequence ,
w6.sequence ,
w7.sequence
ORDER BY
w8.page_order;
this gets teh time down to 800ms (not too shabby..).. and as a prepared 
statement, it only takes 15ms!!! i am hopeful there is a way to totally 
bypass most of this overhead.. but i need more help :\

_
MSN Toolbar provides one-click access to Hotmail from any Web page – FREE 
download! http://clk.atdmt.com/AVE/go/onm00200413ave/direct/01/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Kevin Brown
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Well, I wrote the program to allow testing.  I don't see a complex test
> > as being that much better than simple one.  We don't need accurate
> > numbers.  We just need to know if fsync or O_SYNC is faster.
> 
> Faster than what?  The thing everyone is trying to point out here is
> that it depends on context, and we have little faith that this test
> program creates a context similar to a live Postgres database.

Note, too, that the preferred method isn't likely to depend just on the
operating system, it's likely to depend also on the filesystem type
being used.

Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs,
and that's just off the top of my head.  I imagine the performance of
the various syncing methods will vary significantly between them.


It seems reasonable to me that decisions such as which sync method to
use should initially be made at installation time: have the test program
run on the target filesystem as part of the installation process, and
build the initial postgresql.conf based on the results.  You might even
be able to do some additional testing such as measuring the difference
between random block access and sequential access, and again feed the
results into the postgresql.conf file.  This is no substitute for
experience with the platform, but I expect it's likely to get you closer
to something optimal than doing nothing.  The only question, of course,
is whether or not it's worth going to the effort when it may or may not
gain you a whole lot.  Answering that is going to require some
experimentation with such an automatic configuration system.



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Well, I wrote the program to allow testing.  I don't see a complex test
> as being that much better than simple one.  We don't need accurate
> numbers.  We just need to know if fsync or O_SYNC is faster.

Faster than what?  The thing everyone is trying to point out here is
that it depends on context, and we have little faith that this test
program creates a context similar to a live Postgres database.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Tom Lane
Josh Berkus <[EMAIL PROTECTED]> writes:
> 1) This is an OSS project.   Why not just recruit a bunch of people on 
> PERFORMANCE and GENERAL to test the 4 different synch methods using real 
> databases?   No test like reality, I say 

I agree --- that is likely to yield *far* more useful results than
any standalone test program, for the purpose of finding out what
wal_sync_method to use in real databases.  However, there's a second
issue here: we would like to move sync/checkpoint responsibility into
the bgwriter, and that requires knowing whether it's valid to let one
process fsync on behalf of writes that were done by other processes.
That's got nothing to do with WAL sync performance.  I think that it
would be sensible to make a test program that focuses on this one
specific question.  (There has been some handwaving to the effect that
everybody knows this is safe on Unixen, but I question whether the
handwavers have seen the internals of HPUX or AIX for instance; and
besides we need to worry about Windows now.)

A third reason for having a simple test program is to confirm whether
your drives are syncing at all (cf. hdparm discussion).

> 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to 
> re-evaluate synching anyway?

So far nothing's been done that touches WAL writing.  However, I am
thinking about making the bgwriter process take some of the load of
writing WAL buffers (right now it only writes data-file buffers).
And you're right, after that happens we will need to re-measure.
The open flags will probably become considerably more attractive than
they are now, if the bgwriter handles most non-commit writes of WAL.
(We might also think of letting the bgwriter use a different sync method
than the backends do.)

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Josh Berkus wrote:
> Tom, Bruce,
> 
> > My previous point about checking different fsync spacings corresponds to
> > different assumptions about average transaction size.  I think a useful
> > tool for determining wal_sync_method has got to be able to reflect that
> > range of possibilities.
> 
> Questions:
> 1) This is an OSS project.   Why not just recruit a bunch of people on 
> PERFORMANCE and GENERAL to test the 4 different synch methods using real 
> databases?   No test like reality, I say 

Well, I wrote the program to allow testing.  I don't see a complex test
as being that much better than simple one.  We don't need accurate
numbers.  We just need to know if fsync or O_SYNC is faster.

> 
> 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to 
> re-evaluate synching anyway?

No, it should not change sync issues.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Josh Berkus
Tom, Bruce,

> My previous point about checking different fsync spacings corresponds to
> different assumptions about average transaction size.  I think a useful
> tool for determining wal_sync_method has got to be able to reflect that
> range of possibilities.

Questions:
1) This is an OSS project.   Why not just recruit a bunch of people on 
PERFORMANCE and GENERAL to test the 4 different synch methods using real 
databases?   No test like reality, I say 

2) Won't Jan's work on 7.5 memory and I/O management mean that we have to 
re-evaluate synching anyway?

-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Kurt Roeckx wrote:
> Here are my results on Linux 2.6.1 using cvs version 1.7.
> 
> Those times with > 20 seconds, you really hear the disk go crazy.
> 
> And I have the feeling something must be wrong.  Those results
> are reproducible.
> 

Wow, your O_SYNC times are great.  Where can I buy some?  :-)

Anyway, we do need to find a way to test this because obviously there is
huge platform variability.

---


> 
> Kurt
> 
> 
> Simple write timing:
> write0.139558
> 
> Compare fsync times on write() and non-write() descriptor:
> (If the times are similar, fsync() can sync data written
>  on a different descriptor.)
> write, fsync, close  8.249364
> write, close, fsync  8.356813
> 
> Compare one o_sync write to two:
> one 16k o_sync write28.487650
> two 8k o_sync writes 2.310304
> 
> Compare file sync methods with one 8k write:
> (o_dsync unavailable)
> open o_sync, write   1.010688
> write, fdatasync25.109604
> write, fsync,   26.051218
> 
> Compare file sync methods with 2 8k writes:
> (The fastest should be used for wal_sync_method)
> (o_dsync unavailable)
> open o_sync, write   2.212223
> write, fdatasync27.439907
> write, fsync,   27.772294
> 
> 
> ---(end of broadcast)---
> TIP 5: Have you checked our extensive FAQ?
> 
>http://www.postgresql.org/docs/faqs/FAQ.html
> 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Kurt Roeckx wrote:
> On Thu, Mar 18, 2004 at 02:22:10PM -0500, Bruce Momjian wrote:
> > 
> > OK, what better test do you suggest?  Right now, there has been no
> > testing of these.
> 
> I suggest you start by doing atleast preallocating a 16 MB file
> and do the tests on that, to atleast be somewhat simular to what
> WAL does.
> 
> I have no idea what the access pattern is for normal WAL
> operations or how many times it gets synched.  Does it only do
> f(data)sync() at commit time, or for every block it writes?
> 
> I think if you write more data you'll see more differences
> between O_(D)SYNC and f(data)sync().
> 
> I guess it can depend on if you have lots of small transactions,
> or more big ones.
> 
> Atleast try to make something that covers different access
> patterns.

OK, I preallocated 16mb.  New results:

Simple write timing:
write0.037900

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
write, fsync, close  0.692942
write, close, fsync  0.762524

Compare one o_sync write to two:
one 16k o_sync write 8.494621
two 8k o_sync writes 4.177680

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write   1.836835
(fdatasync unavailable)
write, fsync,1.780872

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write   4.255614
(fdatasync unavailable)
write, fsync,2.120843

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> It's what tested out as the best bet.  I think we were using pgbench
>> as the test platform, which as you know I have doubts about, but at
>> least it is testing one actual write/sync pattern Postgres can generate.

> I assume pgbench has so much variance that trying to see fsync changes
> in there would be hopeless.

The results were fairly reproducible, as I recall; else we'd have looked
for another test method.  You may want to go back and consult the
pghackers archives.

>> * Some of the test cases count open()/close() overhead, some don't.

> The only one I saw that had an extra open() was the fsync after close
> test.  I add a do-nothing open/close to the previous test so they are
> the same.

Why is it sensible to include open/close overhead in the "simple write"
case and not in the "o_sync write" cases, for instance?  Doesn't seem
like a fair comparison to me.  Adding the open overhead to all cases
might make it "fair", but it would also make it not what we want to
measure.

>> * The program is claimed to test whether you can write from one process
>> and fsync from another, but it does no such thing AFAICS.

> It really just shows whether the fsync fater the close has similar
> timing to the one before the close.  That was the best way I could think
> to test it.

Sure, but where's the "separate process" part?  What this seems to test
is whether a single process can sync its own writes through a different
file descriptor; which is interesting but by no means the only thing we
need to be sure of if we want to make the bgwriter handle syncing.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Tom Lane wrote:
> >> As I recall, that was based on testing on some different platforms.
> 
> > But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
> > fsync?
> 
> It's what tested out as the best bet.  I think we were using pgbench
> as the test platform, which as you know I have doubts about, but at
> least it is testing one actual write/sync pattern Postgres can generate.
> The choice between the open flags and fdatasync/fsync depends a whole
> lot on your writing patterns (how much data you tend to write between
> fsync points), so I don't have a lot of faith in randomly-chosen test
> programs as a guide to what to use for Postgres.

I assume pgbench has so much variance that trying to see fsync changes
in there would be hopeless.

> >> What does that mean?  You can't fsync a closed file.
> 
> > You reopen and fsync.
> 
> Um.  I just looked at that test program, and I think it needs a whole
> lot of work yet.
> 
> * Some of the test cases count open()/close() overhead, some don't.
>   This is bad, especially on platforms like Solaris where open() is
>   notoriously expensive.

The only one I saw that had an extra open() was the fsync after close
test.  I add a do-nothing open/close to the previous test so they are
the same.

> * You really cannot put any faith in measuring a single write,
>   especially on a machine that's not *completely* idle otherwise.
>   I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
>   measured the time for that.  (And you have to think about how far

OK, it now measures a loop of 1000.

>   apart the fsyncs are in that sequence; you probably want to repeat the
>   measurement with several different fsync spacings.)  It would also be
>   a good idea to compare writing 1000 successive blocks with rewriting
>   the same block 1000 times --- if the latter does not happen roughly
>   at the disk RPM rate, then we know the drive is lying and all the
>   numbers should be discarded as meaningless.


> 
> * The program is claimed to test whether you can write from one process
>   and fsync from another, but it does no such thing AFAICS.

It really just shows whether the fsync fater the close has similar
timing to the one before the close.  That was the best way I could think
to test it.

> BTW, rather than hard-wiring the test file name, why don't you let it be
> specified on the command line?  That would make it lots easier for
> people to compare the performance of several disk drives, if they have
> 'em.

I have updated the test program in CVS.

New BSD/OS results:

Simple write timing:
write0.034801

Compare fsync times on write() and non-write() descriptor:
(If the times are similar, fsync() can sync data written
 on a different descriptor.)
write, fsync, close  0.868831
write, close, fsync  0.717281

Compare one o_sync write to two:
one 16k o_sync write10.121422
two 8k o_sync writes 4.405151

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write   1.542213
(fdatasync unavailable)
write, fsync,1.703689

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write   4.498607
(fdatasync unavailable)
write, fsync,2.473842

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> As I recall, that was based on testing on some different platforms.

> But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
> fsync?

It's what tested out as the best bet.  I think we were using pgbench
as the test platform, which as you know I have doubts about, but at
least it is testing one actual write/sync pattern Postgres can generate.
The choice between the open flags and fdatasync/fsync depends a whole
lot on your writing patterns (how much data you tend to write between
fsync points), so I don't have a lot of faith in randomly-chosen test
programs as a guide to what to use for Postgres.

>> What does that mean?  You can't fsync a closed file.

> You reopen and fsync.

Um.  I just looked at that test program, and I think it needs a whole
lot of work yet.

* Some of the test cases count open()/close() overhead, some don't.
  This is bad, especially on platforms like Solaris where open() is
  notoriously expensive.

* You really cannot put any faith in measuring a single write,
  especially on a machine that's not *completely* idle otherwise.
  I'd feel somewhat comfortable if you wrote, say, 1000 8K blocks and
  measured the time for that.  (And you have to think about how far
  apart the fsyncs are in that sequence; you probably want to repeat the
  measurement with several different fsync spacings.)  It would also be
  a good idea to compare writing 1000 successive blocks with rewriting
  the same block 1000 times --- if the latter does not happen roughly
  at the disk RPM rate, then we know the drive is lying and all the
  numbers should be discarded as meaningless.

* The program is claimed to test whether you can write from one process
  and fsync from another, but it does no such thing AFAICS.

BTW, rather than hard-wiring the test file name, why don't you let it be
specified on the command line?  That would make it lots easier for
people to compare the performance of several disk drives, if they have
'em.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Kurt Roeckx wrote:
> On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote:
> > > I'm not sure I believe these numbers at all... my experience is that
> > > getting trustworthy disk I/O numbers is *not* easy.
> > 
> > These numbers were reproducable on all the platforms I tested.
> 
> It's not because they are reproducable that they mean anything in
> the real world.

OK, what better test do you suggest?  Right now, there has been no
testing of these.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > I have been poking around with our fsync default options to see if I can
> > improve them.  One issue is that we never default to O_SYNC, but default
> > to O_DSYNC if it exists, which seems strange.
> 
> As I recall, that was based on testing on some different platforms.
> It's not particularly "strange": O_SYNC implies writing at least two
> places on the disk (file and inode).  O_DSYNC or fdatasync should
> theoretically be the fastest alternatives, O_SYNC and fsync the worst.

But why perfer O_DSYNC over fdatasync if you don't prefer O_SYNC over
fsync?

>   
> > Compare fsync before and after write's close:
> > write, fsync, close0.000707
> > write, close, fsync0.000808
> 
> What does that mean?  You can't fsync a closed file.

You reopen and fsync.

> > This shows terrible O_SYNC performance for 2 8k writes, but is faster
> > for a single 8k write.  Strange.
> 
> I'm not sure I believe these numbers at all... my experience is that
> getting trustworthy disk I/O numbers is *not* easy.

These numbers were reproducable on all the platforms I tested.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] string casting for index usage

2004-03-18 Thread Michael Adler

In porting an application from v7.2 and v7.3, I noticed that a join on a varchar 
column and a text column was ignoring indices that were helpful in v7.2. When I 
explicitly cast the text to a varchar (or set ENABLE_SEQSCAN TO false) the index is 
scanned and it works as efficiently as in v7.2. 

Obviously there were many casting improvements made in 7.3, but our application 
doesn't exactly see it that way. Is explicit casting the only solution (other than 
schema modification)? I haven't found anything in the documentation on this subject.

Thanks,
Mike

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] PostgreSQL Disk Usage and Page Size

2004-03-18 Thread Stephan Szabo
On Thu, 18 Mar 2004, Saleh, Amgad H (Amgad) wrote:

>
> Stephan:
>
> In each table we're storing the max. string length.
>
> For example:
>
> for TEST_1, we're storing 'abcdefghjk'  and 'lmnop'
> for TEST_2, we're storing 'abcdefghjk'  and 'lmnopqrstu'
> for TEST_3, we're storing 'abcdefghjk'  and 'lmnopqrstuvw'

Hmm, on my machine it seemed like I was getting slightly different row
count per page results for the first two cases. The last two aren't going
to be different due to padding if the machine pads to 4 byte boundaries.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian

I have been poking around with our fsync default options to see if I can
improve them.  One issue is that we never default to O_SYNC, but default
to O_DSYNC if it exists, which seems strange.

What I did was to beef up my test program and get it into CVS for folks
to run.  What I found was that different operating systems have
different optimal defaults.  On BSD/OS and FreeBSD, fdatasync/fsync was
better, but on Linux, O_DSYNC/O_SYNC was faster.

BSD/OS 4.3:
Simple write timing:
write  0.55

Compare fsync before and after write's close:
write, fsync, close0.000707
write, close, fsync0.000808

Compare one o_sync write to two:
one 16k o_sync write   0.009762
two 8k o_sync writes   0.008799

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000658
(fdatasync unavailable)
write, fsync,  0.000702

Compare file sync methods with 2 8k writes:
(The fastest should be used for wal_sync_method)
(o_dsync unavailable)
open o_sync, write 0.010402
(fdatasync unavailable)
write, fsync,  0.001025

This shows terrible O_SYNC performance for 2 8k writes, but is faster
for a single 8k write.  Strange.

FreeBSD 4.9:
Simple write timing:
write  0.83

Compare fsync before and after write's close:
write, fsync, close0.000412
write, close, fsync0.000453

Compare one o_sync write to two:
one 16k o_sync write   0.000409
two 8k o_sync writes   0.000993

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000683
(fdatasync unavailable)
write, fsync,  0.000405

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000789
(fdatasync unavailable)
write, fsync,  0.000414

This shows fsync to be fastest in both cases.

Linux 2.4.9:
Simple write timing:
write  0.61

Compare fsync before and after write's close:
write, fsync, close0.000398
write, close, fsync0.000407

Compare one o_sync write to two:
one 16k o_sync write   0.000570
two 8k o_sync writes   0.000340

Compare file sync methods with one 8k write:
(o_dsync unavailable)
open o_sync, write 0.000166
write, fdatasync   0.000462
write, fsync,  0.000447

Compare file sync methods with 2 8k writes:
(o_dsync unavailable)
open o_sync, write 0.000334
write, fdatasync   0.000445
write, fsync,  0.000447

This shows O_SYNC to be fastest, even for 2 8k writes.

This unapplied patch:

ftp://candle.pha.pa.us/pub/postgresql/mypatches/fsync

adds DEFAULT_OPEN_SYNC to the bsdi/freebsd/linux template files, which
controls the default for those platforms.  Platforms with no template
default to fdatasync/fsync.

Would other users run src/tools/fsync and report their findings so I can
update the template files for their OS's?  This is a process similar to
our thread testing.

Thanks.

---

Bruce Momjian wrote:
> Mark Kirkwood wrote:
> > This is a well-worn thread title - apologies, but these results seemed 
> > interesting, and hopefully useful in the quest to get better performance 
> > on Solaris:
> > 
> > I was curious to see if the rather uninspiring pgbench performance 
> > obtained from a Sun 280R (see General: ATA Disks and RAID controllers 
> > for database servers) could be improved if more time was spent 
> > tuning.
> > 
> > With the help of a fellow workmate who is a bit of a Solaris guy, we 
> > decided to have a go.
> > 
> > The major performance killer appeared to be mounting the filesystem with 
> > the logging option. The next most significant seemed to be the choice of 
> > sync_method for Pg - the default (open_datasync), which we initially 
> > thought should be the best - appears noticeably slower than fdatasync.
> 
> I thought the default was fdatasync, but looking at the code it seems
> the default is open_datasync if O_DSYNC is available.
> 
> I assume the logic is that we usually do only one write() before
> fsync(), so open_datasync should be faster.  Why do we not use O_FSYNC
> over f

Re: [PERFORM] [HACKERS] fsync method checking

2004-03-18 Thread Bruce Momjian

I have updated my program with your suggested changes and put in
src/tools/fsync.  Please see how you like it.

---

Zeugswetter Andreas SB SD wrote:
> 
> > Running the attached test program shows on BSD/OS 4.3:
> > 
> > write  0.000360
> > write & fsync  0.001391
> 
> I think the "write & fsync" pays for the previous "write" test (same filename).
> 
> > write, close & fsync   0.001308
> > open o_fsync, write0.000924
> 
> I have tried to modify the program to more closely resemble WAL 
> writes (all writes to WAL are 8k), the file is usually already open, 
> and test larger (16k) transactions.
> 
> [EMAIL PROTECTED]:~> test_sync1
> write  0.000625
> write & fsync  0.016748
> write & fdatasync  0.006650
> write, close & fsync   0.017084
> write, close & fdatasync   0.006890
> open o_dsync, write0.015997
> open o_dsync, one write0.007128
> 
> For the last line xlog.c would need to be modified, but the measurements
> seem to imply that it is only worth it on platforms that have O_DSYNC
> but not fdatasync.  
> 
> Andreas

Content-Description: test_sync1.c

[ Attachment, skipping... ]

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] PostgreSQL Disk Usage and Page Size

2004-03-18 Thread Stephan Szabo
On Thu, 18 Mar 2004, Saleh, Amgad H (Amgad) wrote:

> Stephan / Stephen
>
> We know about the overhead and do understand the math you've provided.
> This is not the question we're asking. We've just provided the table definitions as
> examples.
>
> The real question was, even with the 52 & 56 (assuming right),' I wouldn't get
> the same number of records per page for all 4k, 8k, 16k, and 32k pages.

On my system, I don't using your tests, IIRC I got 134 with TEST_1 and
like 128 or so on TEST_2 when I used strings of maximum length for the
columns.

>
> To make it more clear to you here's an example:
>
> For an 8k-page: we've got 120 records/page for both tables and other tables such as
>
>  CREATE TABLE TEST_3 (
>  F1 VARCHAR(10),
>  F2 VARCHAR(12) );

Are you storing the same data in all three tables or different data in all
three tables? That's important because there's no difference in length
between varchar(5) and varchar(12) when storing the same 5 character
string.


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html